Diffusion Wavelets. Ronald R. Coifman, Mauro Maggioni

Size: px

Start display at page:

Download "Diffusion Wavelets. Ronald R. Coifman, Mauro Maggioni"

Bertina Cain
5 years ago
Views:

1 Diffusion Wavelets Ronald R. Coifman, Mauro Maggioni Program in Applied Mathematics Department of Mathematics Yale University New Haven,CT,65 U.S.A. Abstract We present a multiresolution construction for efficiently computing, compressing and applying large powers of operators that have high powers with low numerical rank. This allows the fast computation of functions of the operator, notably the associated Green s function, in compressed form, and their fast application. Classes of operators satisfying these conditions include diffusion-like operators, in any dimension, on manifolds, graphs, and in non-homogeneous media. In this case our construction can be viewed as a far-reaching generalization of Fast Multipole Methods, achieved through a different point of view, and of the non-standard wavelet representation of Calderón-Zygmund and pseudodifferential operators, achieved through a different multiresolution analysis adapted to the operator. We show how the dyadic powers of an operator can be used to induce a multiresolution analysis, as in classical Littlewood-Paley and wavelet theory, and we show how to construct, with fast and stable algorithms, scaling function and wavelet bases associated to this multiresolution analysis, and the corresponding downsampling operators, and use them to compress the corresponding powers of the operator. This allows to extend multiscale signal processing to general spaces (such as manifolds and graphs) in a very natural way, with corresponding fast algorithms. Key words: Wavelets, Wavelets on Manifolds, Wavelets on Graphs, Heat Diffusion, Laplace-Beltrami Operator, Diffusion Semigroups, Fast Multipole Method, Matrix Compression, Spectral Graph Theory. addresses: coifman@math.yale.edu ( Ronald R. Coifman), mauro.maggioni@yale.edu ( Mauro Maggioni ). URL: ( Mauro Maggioni ). Preprint submitted to Elsevier Science 7 August 4

2 Introduction We introduce a multiresolution geometric construction for the efficient computation of high powers of local operators (in order O(n log n)forseveral classes of operators, where n is the cardinality of the space). The operators under consideration are positive contractions that have high powers with low numerical rank. This organization of the geometry of a matrix representing T enables fast computation of functions of the operator, notably the associated Green s function, in compressed form. Classes of operators satisfying these conditions include discretizations of differential operators, in any dimension, on manifolds, and in non-homogeneous media. Our construction can be viewed as an extension of Fast Multipole Methods [], and of the non-standard wavelet form for Calderón-Zygmund integral operators and pseudo-differential operators of []. Unlike the integral equation approach we start from the generator T of the semigroup associated to a differential operator, rather than from the Green operator. For most examples above the (dyadic) powers of this operator decrease in rank, thus suggesting the compression of the function (and geometric) space upon which each power acts. The scheme we propose consists in the following: apply T to a space of test functions at the finest scale, compress the range via a local orthogonalization procedure, represent T in the compressed range, compute T, compress and orthogonalize, and so on: at scale j we obtain a compressed representation of T j+, acting on the range of T j+, for which we have a (compressed) orthonormal basis, then we apply T j+, locally orthogonalize and compress the result, thus getting the next coarser subspace. The computation of the inverse Laplacian (I T ) (of course, on the complement of the eigenspace corresponding to the eigenvalue ) can be carried out (in compressed form) via the Schultz method []: we have (I T ) + f = T k f which implies, letting S K = K k= T k, S K+ = S K + T K S K = k= K ( I + T k) f. (.) Since we can apply fast T k to any function f, and hence the product S K+,we can apply (I T ) to any function f fast, in time O(n log n). Moreover, since this construction adapts to the natural geometry induced by the operator, it is very promising in view of applications to the problem of homogenization of linear diffusion partial differential operators with non-constant coefficients. The interplay between geometry of sets, function spaces on sets, and operators on sets is of course classical in Harmonic Analysis. Our construction views the columns of a matrix representing T as data points in Euclidean k=

3 space which are then viewed as lying on a manifold, for which the first few eigenvectors of T provide coordinates (see [3,4]). The spectral theory of T provides a Fourier Analysis on this set relating our approach to multiscale geometric analysis, multiscale Fourier and wavelet analysis. The action of a given diffusion semigroup on the space of functions on the set is analyzed in a multiresolution fashion, where (dyadic powers of) the diffusion operator corresponds to dilations, and projections (in general nonrothogonal) correspond to downsampling. The localization of the scaling functions we construct then allows to reinterpret these operations in function space in a geometric fashion. This mathematical construction has a numerical implementation which is fast and stable. We thus get a fast construction of multiresolution analyses on rather general spaces, with respect to the dilations induced by a diffusion semigroup, and fast algorithms for computing the corresponding transform. The framework we consider in this work includes at once large classes of manifolds, graphs, spaces of homogeneous type, and can be extended even further. Given a metric measure space and a symmetric diffusion semigroup on it, there is a natural multiresolution analysis associated to it, and an associated Littlewood-Paley theory for the decomposition of the natural function spaces on the metric space. These ideas are classical (see [5] for a specific instance in which the diffusion semigroup is center-stage, but the literature on multiscale decompositions in general settings is vast, see e.g. [6 ] and references therein). Generalized Heisenberg principles exist [] that guarantee that eigenspaces are well approximated by scaling functions spaces at the appropriate scale. Our work shows that scaling functions and wavelets can be constructed and fast numerical algorithms that allow the implementation of these ideas exist. In effect, this construction allows to lift multiscale signal processing techniques to datasets, for compression and denoising of functions and operators on a dataset (and of the data set itself), for approximation and learning. It also allows the introduction of dictionaries of diffusion wavelet packets [], with the corresponding fast algorithms for best basis selection. Material related to this paper, such as examples, and Matlab scripts for their generatation, will be made available on the web page at yale.edu/~mmm8/diffusionwavelets.html. Note Because of the scope of this article and for space constraints, it is not possible to list all the references to all the works related to motivations, techniques, and applications of the constructions and algorithms introduced in this paper. We provide references to the most relevant works, but many others had to be omitted. 3

4 The construction in the finite dimensional discrete setting In this section we present a particular case of our construction in a finitedimensional, purely discrete setting, for which only finite dimensional linear algebra is needed. We refer the reader to the diagram in Figure for a visual representation of the scheme presented σ(t).4 σ(t 3 ).3 σ(t 7 ). σ(t 5 ). ε V V V V 4 3 Fig.. Spectra of powers of T and corresponding multiscale eigenspace decomposition. We consider a finite graph X and a symmetric positive definite and positive diffusion operator T on (functions on) X (for a precise definition and discussion of these hyperotheses see Section 4). Without loss of generality, we can assume that T. For example: (i) The graph could represent a metric space in which points are data (e.g. documents) and edges have weights (e.g. a function of the similarity between documents), and I T could be a Laplacian on X, that induces a natural diffusion process. T is then similar to a Markov matrix P representing the natural random walk on the graph [3]. (ii) X could represent the discretization of a domain and T = e,where isa partial differential operator such as a Laplacian on the domain, or manifold with smooth boundary. Our main assumptions are that T is local, i.e. T (δ k ), where δ k is the Dirac δ-function at k X, has small support, and that high powers of T have low numerical rank. Ideally there exists a γ< such that for every j wehave rk ɛ (T j ) <γrk ɛ (T j ), where rk ɛ denotes the ɛ-numerical rank as in definition 6. We want to compute and describe efficiently the powers T j,forj>, which describe the long term behavior of the diffusion. This will allow the computation of functions of the operator in compressed form (notably of the Green s function (I T ) ), as well as the fast computation of the diffusion from any initial conditions. This is of course of great interest in the solution of dis- 4

5 cretized partial differential equations, of Markov chains, but also in learning and related classification problems. M M Φ Φ... Φ j+... T G T G G j M j Tj j Φ Φ... Φj+ G j+ M j+ Fig.. Diagram for downsampling, orthogonalization and operator compression. (All triangles are commutative by construction) The reason why one expects to be able to compress high powers of T is that they are low rank by assumption, so that it should be possible to efficiently represent them on an appropriate basis, at the appropriate resolution. From the analyst s perspective, these high powers are smooth functions with small gradient (or even in a sense band-limited with small band), hence compressible. We start by fixing a precision ɛ>, assume that T is self-adjoint and is represented on the basis Φ = {δ k } k X, and consider the columns of T,which can be interpreted as the set of functions Φ = {Tδ k } k X on X. Weusea local multiscale Gram-Schmidt procedure, which is a linear transformation we represent by a matrix G, to carefully orthonormalize these columns to get abasisφ = {ϕ,k } k X (X is defined as this index set) for the range of T up to precision ɛ. This yields a subspace that we denote by V. Essentially Φ is a basis for a subspace which is ɛ-close to the range of T, and with basis elements that are well-localized. Moreover, the elements of Φ are coarses than the elements of Φ, since the dilation they are the result of applying T once. Obviously X X but the inequality may already be strict since part of the numerical range of T may be below the precision ɛ. Whether this is the case or not, we have then a map M from X to X, which is the composition of T with the orthonormalization by G. We can also represent T in the basis Φ : we denote this matrix by T and compute T = M M T. We proceed now by looking at the columns of T,whichare Φ = {T δ k } k X i.e., by unravelling the bases on which this is happening, {T ϕ,k } k X up to the precision ɛ. Again we can apply a local Gram-Schmidt procedure to orthonormalize this set: this yields a matrix G and an orthonormal basis Φ = {ϕ,k } k X for the range of T up to precision ɛ, and hence for the range of T 3 up to precision ɛ. Moreover, depending on the decay of the spectrum of T, X is in general afractionof X. The matrix M which is the composition of G with T is then of size X X,andT = M M T is a representation of T 4 acting on Φ. 5

6 After j steps in this fashion, we will have a representation of T j = T j+ onto a basis Φ j = {ϕ j,k } k Xj. Depending on the decay of the spectrum of T, we expect X j << X, in fact in the ideal situation the spectrum of T decays fast enough so that there exists γ< such that X j <γ j+ X. While Φ j is naturally identified with the set of Dirac δ-functions on X j,we can extend these functions living on the compressed (or downsampled ) graph X j to the whole initial graph X by writing Φ j = M M j Φ :since every function in Φ is defined on X, so is every function in Φ j. Hence any function on the compressed space X j can be extended naturally to the whole X. In particular, one can compute low-frequency eigenfunctions on X j,and then extend them to the whole X. The elements in Φ j are at scale T j+,and are much coarses, and smoother, than the initial elements in Φ,whichis way they can be represented in compressed form. The projection of a function onto the subspace spanned by Φ j will be by definition an approximation to that function at that particular scale. There is an associated fast scaling function transform: suppose we are given f on X and want to compute f,ϕ j,k for all scales j and corresponding translations k. Beinggivenf means we are given ( f,ϕ,k ) k X.Thenwecan compute ( f,ϕ,k ) k X = M ( f,ϕ,k ) k X, and so on for all scales. The matrices M j are sparse (since T j and G j are), so this computation is fast. This generalizes the classical scaling function transform. We will show later that wavelets can be constructed as well, and that a fast wavelet transform is also possible. Inthesameway,anypowerofT can be applied fast to a function f. In particular the Green s function (I T ) can be applied fast to any function, since it can represented as product of dyadic powers of T as in (.), each of which can be applied fast. We are at the same time compressing the powers of the operator T and the space X itself, at essentially the optimal rate at each scale, as dictated by the portion of the spectrum of the powers of T which is above the precision ɛ. Observe that each point in X j can be considered as a local aggregation of points in X j, which is completely dictated by the action of the operator T on functions on X: the operator itself is dictating the geometry with respect to which it should be analyzed, compressed or applied to any vector. 3 Notation and Definitions To present our construction in some generality, we will need the following definitions and notations. 6

7 Definition Let X be a set. A function d : X X [, + ) is called a quasi-metric if (i) d(x, y) for every x, y X, with equality if and only if x = y, (ii) d(x, y) =d(y, x), for every x, y X, (iii) there exists C> such that d(x, y) C(d(x, z)+d(z, y)) for every x, y, z X (quasi-triangle inequality). The pair (X, d) is called a quasi-metric space. (X, d) is a metric space if one can choose C =in (iii). Example A weighted undirected connected graph (G, E, W ), wherew is the set of positive weights on the edges in E, is a metric space when the distance is defined by d(x, y) =inf γ x,y e γ x,y w e where γ x,y isapathconnectingx and y. Often in applications a measure on (the vertices of) G is either uniform or specified by some connectivity properties of each vertex (e.g. sum of weights of the edges concurrent in each vertex). Let (X, d, µ) be a quasi-metric space with Borel measure µ. Forx X, δ>, let B δ (x) ={y X : d(x, y) <δ} be the open ball of radius δ around x. For a subset S of X, we will use the notation N δ (S) ={x X : y S : d(x, y) <δ} for the δ-neighborhood of S and, for δ <δ,welet N δ,δ (S) =N δ \N δ. Since we would like to handle finite, discrete and continuous situations in a unique framework, and since many finite-dimensional discrete problems arise from the discretization of continuous and infinite-dimensional problems, we introduce the following definitions. Definition 3 (Coifman and Weiss, [?]) A quasi-metric measure space (X, d, µ) is said to be of homogeneous type [?,6] if µ is a non-negative Borel measure and there exists a constant C X > such that for every x X, δ>, µ(b δ (x)) C X µ(b δ (x)) (3.) We assume µ(b δ (x)) < for all x X, δ >, and we will work on connected spaces X. In the continuous situation, we will assume that µ({x}) = for every x X. 7

8 One can replace the quasi-metric d with an equivalent quasi-metric ρ so that the δ-balls in the ρ metric have measure approximately δ: it is enough to define ρ(x, y) as the measure of the smallest ball containing x and y. Onecanalso assume some Hölder-smoothness for ρ, in the sense that ρ(x, y) ρ(x,y) cρ(x, x ) β [ρ(x, y)+ρ(x,y)] β for some β (, ),c>, see [7,8,5,6]. Example 4 Examples of spaces of homogeneous type include: (i) Euclidean spaces of any dimension, with isotropic or anisotropic metrics induced by positive-definite bilinear forms and their powers (see e.g. [7]). (ii) Compact Riemannian manifolds with respect to geodesic metric, or also with respect to metrics induced by certain classes of vector fields [8]. (iii) Graphs with bounded degree (the degree of a vertex is defined as d x = y x w yx, where y x means y is connected to x), in particular regular graphs. Definition 5 A family of functions {ϕ k } k on (X, d) is called δ-local, for some δ>, if sup diam.(supp.ϕ k ) <δ, k i.e. there exists a set {x k } k X, called a supporting set for {ϕ k } k, such that for every k we have supp.ϕ k B δ (x k ). Notation If V is a closed linear subspace of L (X, µ), we denote the orthogonal projection onto V by P V. The closure V of any subspace V is taken in L (X, µ), unless otherwise specified. Definition 6 Two subspaces V and W are ɛ-close if there exists a linear isomorphism L with I L <ɛmapping V onto W. An ɛ-numerical span of a set of vectors {ϕ k } k in L is defined as a closed subspace which is ɛ-close to the closure of the span of {ϕ k } k. An ɛ-numerical rank of an operator T is a subspace which is ɛ-close to the range of T. The definition of numerical kernel is analogous. Notation If L is a self-adjoint bounded operator on L (X, µ), withspectrum σ L, and spectral decomposition L = λ de λ, σ L 8

9 we define L ɛ = {λ σ L : λ >ɛ} λ de λ. 4 Multiresolution Analysis induced by symmetric diffusion semigroups 4. Symmetric diffusion semigroups We start from the following definition from [5]. Definition 7 Let {T t } t [,+ ) be a family of operators on (X, µ), each mapping L (X, µ) into itself. Suppose this family is a semigroup, i.e. T = I and T t +t = T t T t for any t,t [, + ), andlim t + T t f = f in L (X, µ) for any f L (X, µ). Such a semigroup is called a symmetric diffusion semigroup if it satisfies the following: (i) T t p, for every p + (contraction property). (ii) Each T t is self-adjoint. (symmetry property). (iii) T t is positive preserving: Tf for every f in L (X) (positivity property). (iv) The semigroup has a generator, so that T t = e t, (4.) While some of these assumptions are not strictly necessary, their adoption simplifies this presentation without reducing the types of applications we are interested in at this time. We will denote by σ T the spectrum of T (so that {λ t } λ σt is the spectrum of T t ), and by {ξ λ } λ σt the corresponding basis (orthogonal since T t is normal) of eigenvectors, normalized in L (without loss of generality). Observe that here and in all that follows we will use the abuse of notation that implicitly accounts for possible multiplicities of the eigenvalues. Remark 8 Observe that σ T [, ]: by the semigroup property and (ii), we have ( ) T = T T = T T so the spectrum is non-negative. The upper bound obviously follows from condition (i). 9

10 Remark 9 There are classes of diffusion operators that are not self-adjoint but very important in applications. The construction and the algorithm we propose do not depend on this hypothesis, however the interpretation of many results does depend on the spectral decomposition. In the paper [9] we will address these and related issues in a broader framework. Example Examples include (a) The Poisson semigroup, for example on the circle or half-space, or anisotropic and/or higher dimensional anisotropic versions (e.g. [7]). (b) The random walk diffusion induced by a symmetric Markov chain (on a graph, a manifold etc...). (c) The semigroup T t = e tl generated by a second-order differential operator on some interval (a, b) (a and/or b possibly infinite), in the form Lf = a (x) d dx f + a (x) d dx f + a (x)f(x) with a (x) >, a (x), acting on a subspace of L ((a, b),q(x)dx), where q is an appropriate weight function, given by imposing appropriate boundary conditions, so that L is (unbounded) self-adjoint. Conditions (i) to (iii) are satisfied, and (iv) is satisfied if c =. This extends to R n by considering elliptic partial differential operators in the form Lf = w(x) n ( a ij (x) f + c(x)f, i= x i x j where we assume c(x) and w(x) >, and we consider this operator as defined on a smooth manifold. L, applied to functions with appropriate boundary conditions, is formally self-adjoint and generates a semigroup satisfying (i) to (iii), and (iv) is satisfied if c =. An important case is the Laplace-Beltrami operator on a compact smooth manifold (or even on a Lie group), and subordinated operators [5]. (f) If (X, d, µ) is derived from a graph (G, W ) as described above in example, one can define D i = j W ij and then the matrix D W is a Markov matrix, which corresponds to the natural random walk on G. The operator L = D (I W )D is the normalized Laplacian on graphs, it is a contraction on L(G) and is self-adjoint. This discrete setting is extremely useful in applications and widely used in a number of fields such as data analysis (e.g. clustering, learning on manifolds, parametrization of data sets), computer vision (e.g. segmentation) and computer science (e.g. network design, distributed processing). We believe our construction will have application in all these settings. As a starting point, see for example [3] for an overview, and [] for application to image segmentation. (g) One parameter groups (continuous or discrete) of dilations in R n,orother motion groups (e.g. the Galilei group or the Heisenberg-type groups), act on )

11 square-integrable functions on these spaces (with the appropriate measures) as isometries or contractions. Continuous wavelets in some of these settings have been studied (see e.g. [] and references therein), but an efficient discretization of such transforms seems still an open problem (see however []). (h) The random walk diffusion on a hypergroup (see e.g. [3] and references therein). Definition A positive diffusion semigroup of operators acts η-regularly, for some η>, if for every x and every smooth test function ϕδ-local around x, the function Tϕ is ηδ-local around x. 4. Diffusion metrics and embeddings induced by a diffusion semigroup The material in this section and its applications are discussed in greater detail in [4,3] and references therein. From now on, in all that follows, we shall assume, mainly for simplicity, that T is compact. Being positive definite, T t induces the diffusion metric d (t) (x, y) = λ t (ξ λ (x) ξ λ (y)) λ σ T = δ x δ y,t t (δ x δ y ) = T t δx T t δy. (4.) Definition The metric defined above is called the diffusion metric associated to T,attimet. If the action of T t on L (X, µ) can be represented by a (symmetric) kernel K t (x, y), then d (t) can be written as d (t) (x, y) = since the spectral decomposition of K is K t (x, y) = K t (x, x)+k t (y, y) K t (x, y), (4.3) λ σ T λ t ξ λ (x)ξ λ (y). (4.4) For any subset σ T σ T we can consider the map of metric spaces Φ (t) σ T :(X, d (t) ) (R σ T,d Euc. ) x ( λ t ξλ (x) ) λ σ T (4.5)

12 which is in particular cases called eigenmap [5,6], and is a form of local multidimensional scaling (see also [3,4,4]). By the definition of d (t),thismap is an isometry when σ T = σ T, and an approximation to an isometry when σ T σ T. If σ T is the set of the first n top eigenvalues, and if X K(x, y)dµ =,then Φ σ T is a minimum local distortion map from (X, d (t) )tor n, in the sense it minimizes n tr(p {φλ } λ σ K (t) P {φ λ } T λ σ )= T (φ λ (x) φ λ (y)) K (t) (x, y)dµ(x)dµ(y) X X λ σ T (4.6) among all possible maps x (φ (x),...,φ n (x)) such that φ i,φ j = δ ij.this is just a rephrasing, in our situation, of the well-known fact that the top k singular vectors span the best approximating k dimensional subspace to the domain of a linear operator, in the L sense. See e.g. [7] and references therein for a comparison between different dimensionality reduction techniques that canbecastinthisframework. These ideas are related to various techniques used for nonlinear dimension reduction: see for example for a list of relevant links and [8 3,4,3,3] and references therein. Observe that for any fixed precision ɛ, one can find σ T so that the eigenfunction expansion (4.) of d (t) can be truncated on σ T with approximation error smaller than ɛ. Finally, we observe that the semigroup not only generates a family of metrics on X, but a family of measures as well. We simply consider the σ-algebra A (t) generated by the family of sets S (t) = {B (t) δ (x)) := supp.tt (χ Bδ (x)),x X, δ > } (where χ A denotes the characteristic function of a set A) and define µ (t) (B (t) δ (x)) = µ(b δ (x)) on S (t) and extend it to A (t). Example 3 Suppose we have a symmetric Markov chain M on (X, µ), which we can assume irreducible, with transition matrix P. We assume P is positive definite (otherwise we could consider (I + P )). Let {λ l} l be the set of eigenvalues of P, ordered by increasing value, and {ξ l } l be the corresponding set of right eigenvectors of P, which form an orthonormal basis. Then by functional calculus we have P t i,j = l λ t l ξ l(i)ξ l (j)

13 and for an initial distribution f on X, we define P t f = l λ t l f,ξ l ξ l. (4.7) Observe that if f is a probability distribution, so is P t f for every t since P is Markov. The diffusion map Φ σ T embeds the Markov chain in Euclidean R σ T in such a way that the diffusion metric induced by M on X becomes Euclidean distance in R σ T. 4.3 Construction of scaling functions and wavelets We can interpret T as a dilation operator acting on functions in L (X, µ), and use it to define a multiresolution structure. As in [5] and in classical wavelet theory, we start by discretizing the semigroup at the increasing sequence of times j t j = l = j+. (4.8) i= Let {λ i } i be the spectrum of T, ordered in decreasing order, and {ξ i } i the corresponding eigenvectors. We can define low-pass portions of the spectrum by letting σ T,j = {λ σ T,λ t j ɛ}. (4.9) For a fixed ɛ (, ) (which we may think of as our precision), we define the (finite dimensional!) approximation spaces by V j =< {ξ λ : λ σ T,λ t j ɛ} >=< {ξ λ : λ σ T,j } > (4.) The set of subspaces {V j } j Z is a multiresolution analysis in the sense that it satisfies the following properties: (i) lim j V j =Ran(T ), lim j + V j =< {ξ i : λ i =} >. (ii) V j+ V j for every j Z. (iii) {ξ λ : λ t j ɛ} is an orthonormal basis for V j. We can also define, for j, the subspaces W j as the orthogonal complement of V j+ in V j, so that we have the familiar relation between approximation and detail subspaces as in the classical wavelet multiresolution constructions: V j+ = V j W j. (4.) 3

14 The direct orthogonal sum L (X) = W j j is a wavelet decomposition of the space, induced by the diffusion semigroup, and related to the Littlewood-Paley decomposition studied in this setting by Stein [5]. While by definition (4.) we already have an orthonormal basis of eigenfunctions of T for each subspace V j (and for the subspaces W j as well), these basis functions are in general highly non-localized, being generalized global Fourier modes of the operator. Our aim is to build localized bases for each of these subspaces, starting from a basis of the fine subspace V and explicitly constructing a downsampling scheme that yields an orthonormal basis for V j, j>. This is motivated by general Heisenberg principles (see e.g. [33] for a setting similar to ours) that guarantee that eigenfunctions have a smoothness or frequency content or scale determined by the corresponding eigenvalues, and can be reconstructed by maximally localized bump functions, or atoms, at that scale. Such ideas have many applications in numerical analysis, especially to problems motivated by physics (multigrid techniques [34,35], matrix compression [36 39], etc...) and are in general behind many techniques for multiscale compression. We avoid the computation of the eigenfunctions, nevertheless the approximation spaces Ṽj that we build will be about ɛ-approximations of the true V j s. We propose two ways of constructing the scaling functions in the Multi- Resolution. One uses the following: Proposition 4 (Multiscale local orthonormalization) Let Φ be a δ-local set of functions on X, spanning a subspace V, S a supporting set for Φ, and let ɛ>. Then there exists an orthonormal basis of V such that: Φ={Φ l } l=,...,l = {{ϕ l,k } k Kl } l=,...,l (a) for each l {,...,L}, thefamily{ϕ l,k } k Kl is ( l+ )δ-local and it has a supporting set S l = {x l,k } k Kl S that is a ( l+ )δ-lattice. In particular K l l µ(x) µ(b δ ) ; (b) the linear map G from Φ to Φ is local and sparse; (c) there exists a monotonic function L such that for each l, {Φ l } l=,...,l(l) spans a subspace which is ɛ-close to the subspace spanned by the top l singular vectors of Φ. 4

15 We postpone the proof of this Proposition, and comments on its numerical aspects and implementation, till section 5.. Here we would like to remark that under rather general assumptions on {Φ} and geometric assumptions on its supporting set S, one can prove that the constructed Φ have exponential decay. For the second orthonormalization technique, which guarantees good bounds on the decay on the orthonormal functions we build, we will need the following definition from [6] Definition 5 Amatrix(B) (j,k) J J is called η-accreative, if there exists an η> such that for every ξ l (J) we have Re j B jk ξ j ξ k η k j ξ l (J). In [6, Proposition 3 in Section.4], Coifman and Meyer prove a result suggesting the following Proposition 6 Let (X, d, µ) be a space of homogeneous type, and let Φ = { ϕ j } j J be a countable, δ-local Riesz basis of L (X, µ), with supporting set S = {s j } j J. If the Gramian matrix G ij = ϕ i, ϕ j is η-accreative and there exist C, α > such that G i,j Ce αd(s i,s j ), then there exist C,α > and an orthonormal basis {ϕ} j J such that f j = k J γ(j, k) ϕ k with γ(j, k) C e βd(s i,s j ). The proof proceeds exactly as in [6], with the traingle inequality on Z n replaced by the quasi-triangle inequality for the metric d. Definition 7 Let (X, ρ, µ) be a space of homogeneous type, and γ>. A subset {x i } i of X is a γ-lattice if ρ(x i,x j ) >cγ, for all i, j, and for every y X, there exists i(y) such that ρ(y, x i(y) ) <cγ.herec depends only on the constant C X of the space of homogeneous type. We first present our construction in general. Theorem 8 (Construction of the Multi-Resolution Analysis) Let {T t } be a symmetric diffusion semigroup acting regularly on X, Φ ={ ϕ k } k K be a countable orthonormal δ-local basis, with supporting set S. Let Ṽ = < Φ > and assume Ṽ is ɛ-close to V,forsomefixedɛ>. 5

16 Then we can build a set of linear maps {G j } and {M j } as in figure, and scaling functions with the following properties: (I) For every j, let Φ={{{ϕ j,l,k } k Kj,l } l=,...,lj } j=,...,j, Φ j = {{ϕ j,l,k } k Kj,l } l=,...,lj and Ṽ j = < Φ j >.ThenΦ j is an orthonormal basis and Ṽ j is (j +)ɛ-close to V j. (II) For every j, the orthonormal basis Φ j has the following layered structure: (a) For every l {,...,L j }, K j,l j+ µ(x). µ(b (t j ) ) δ (b) For every l L j, with respect to the metric d (tj),theset{ϕ j,l,k } k Kj,l is ( l+ )δ-local and admits a supporting set S j,l that is a ( l+ )δ-lattice. In particular, supp. ϕ j,l,k supp. ϕ j,l,k =, k,k K j,l, k k. (III) The representation of T j+ on Φ j is faithful, in the sense that T j+ P Φj is (j +)ɛ-close to T j+ Ran(T j+ ). (IV) The orthonormalization map G j mapping Φ j = T j Φ j onto Φ j+,andthe transformation M j = G j T j mapping Φ j onto Φ j+ are local and sparse. A localized orthonormal basis of wavelets spanning the subspace W j,defined as in (4.), can be constructed. Alternatively, by using the orthogonalization of Proposition 6, a family of scaling functions Φ with exponential decay for the same subspaces Ṽ j can be constructed. PROOF. We apply T to the given orthonormal basis Φ (dilation step). This yields a ηδ-local family of functions { ϕ,k } k K on X whose span Ṽ is ɛ-distant from V, by definition of V, because Ṽ was ɛ-close to V and T ɛ is a contraction. We then apply Proposition 4 or to { ϕ,k } k K (downsampling step) togetthe orthonormal basis Φ = {{ϕ,l,k } k K,l } l=,...,l for a subspace Ṽ which is ɛ-close to Ṽ, and hence ɛ-close to V,withthe properties claimed in Theorem 8. At this point we represent (T ) ɛ on Φ. Observe that only the top L(ɛ )layers of Φ are needed to express (T ) ɛ up to precision ɛ, because of the properties of 6

17 the algorithm used to build Φ. Moreover the subset {{ϕ,l,k } k K,l } l=,...,l(ɛ ) of Φ is (η L(ɛ )+ )δ-local. By induction, once we have built Φ j with all the properties claimed, we apply (T j+ ) ɛ to Φ j getting a set of functions Φ j. As before, observe that only the first L(ɛ j+ )layersofφ j will be necessary to express T j+ to precision ɛ, and this subset is local. Applying Proposition 4 yields an orthonormal basis Φ j+ with the properties claimed. The map G j is constructed in the proof of Proposition 4, and is a local orthogonalization map that is sparse, being a multiscale orthogonalization of local functions. The orthogonalization produces an error term of order ɛ in the distance between Ṽj and Ṽ j. Observe that each orthonormalization could have achieved by applying Proposition 6, since at each level we are applying a self-adjoint, strictly positive operator to an orthonormal basis: it is easy to see that the resulting system has an accreative Gramian matrix. 4.4 Downsampling and compression of the powers of T Let us summarize this construction in terms of linear operators (and corresponding matrix representations), to show how this leads not only to the fast construction of the multiscale analysis and the orthonormal bases of scaling functions, but also to a multi-level compression of the operator at different scales. Figure is again a visual summary of what follows. By induction suppose we have constructed the orthonormal scaling functions Φ j = {{ϕ j,l,k } k Kj,l } l=,...,lj at scale j, written on the orthonormal basis Φ j = {{ϕ j,l,k } k Kj,l } l=,...,lj at scale j, and compress T j by representing it on these two bases in the domain and range respectively. Then we perform two operations: we compute Φ j+ = { ϕ j,l,k } = {T j ϕ j,l,k } for every k K j,l, l {,...,L j } (this operation is fast because T j is compressed and well-localized after compression), and then use a fast local modified Gram-Schmidt to find the numerical span of { ϕ j,l,k } k Kj,l and at the same time orthogonalize them. This is a linear map G j that yields the orthonormal basis Φ j+ = {ϕ j+,l,k } k Kj+,l, l {,...,L j+ }, naturally represented on the bases Φ j (in the domain) and Φ j+ (in the range), such that < Φ j+ > is ɛ-close to < Φ j+ >.ThemapM j is the composition G j T j, and is thus naturally represented on the bases Φ j (in the domain) and Φ j+ (in the range). Observe that for j = we can assume the operator 7

18 T is given represented on the basis {ϕ k } k K, and then we represent T j on the basis Φ j+ simply by computing T j j = ( ) Φ j Tj j ΦT j = Mj Mj T. (4.) Φ j is of size K j L j K j L j,sotj j is of the same size, and is a compressed version of the operator T j acting on Ṽj. We can interpret this as having downsampled the subspace V j at the critical rate for representing up to precision the action of T j onit.thisisrelatedto the Heisenberg principle and sampling theory for these operators [], which implies that low-frequency eigenfunctions are not well localized and can be synthesized by coarse (depending on the scale) bump functions. One then expects the existence of quadrature formulas determined by very few samples: we consider this problem of great importance and it is currently being investigated. Most of the functions {ϕ j,l,k } are essentially as well localized as it is compatible with their being in V j. Moreover, because of this localization property, we can interpret this downsampling in function space geometrically. We have the identifications X j := {x l,k : x l,k is center of ϕ j,l,k } K j,l {ϕ j,l,k } k Kj,l. (4.3) The natural metric on X j is d (t j), which is, by (4.), the distance between the ϕ j,l,k in L (X, µ), and can be computed recursively in each X j by combining (4.) with (4.). This allows us to interpret X j as a space representing the metric d (t j) in compressed form. In our construction we only compute ϕ j,l,k expanded on the basis {ϕ j,l,k } k Kj,l (this is the role of M j ), in other words, up to the identifications above, we know ϕ j,l,k on the downsampled space X j only. However we can extend these functions to X j and recursively all the way down to X = X, just by using the multiscale relations: ϕ j,l,k (x) =M j ϕ j,l,k (x),x X j j = M j M j... M ϕ,l,k (x) = M l ϕ,l,k (x),x X l= (4.4) This is of course completely analogous to the standard construction of scaling functions in the Euclidean setting [4,4,4,,4]. This formula also immediately generalizes to arbitrary functions in V j, extending them from X j to the whole original space X (see for example Figure 6). The matrix M j is sparse, since T j is expected to be local (in the compressed space X j )andg j is sparse (as we show in the Appendix), so that M j has essentially X j log X j elements above precision (in fact, in many cases of 8

19 practical interest, only X!), at least for j not too large. When j is large the operator is in general not local anymore, but the space X j on which it is compressed is expected to be very small. For interesting classes of operators we actually expect to have only X j entries in M j which are bigger than ɛ, and thus the algorithms presented will have order O(n). This follows from the observation that the basis functions {{ϕ l,k } k Kl } l=,...,l by construction essentially span the subspace corresponding to the top singular vectors of Φ =T j Φ j, which correspond to the (fast) decaying singular values of T j. Remark 9 We could have started from Ṽ and the defined V j as the result of j steps of our algorithm: in this way we could do without the spectral decomposition for the semigroup. This permits the application of our whole construction and algorithms to the non-self-adjoint case. Remark (Biorthogonal bases) While at this point we are mainly concerned with the construction of orthonormal bases for the approximation spaces V j, well-conditioned bases would be just as good for most purposes, and would lead to the construction of stable biorthogonal scaling function bases. This could follow the ideas of dual operators in Coifman s formula on space of homogeneous type [43,6], and also, exactly as in the classical biorthogonal wavelet construction [44], we would have two ladders of approximation subspaces, with wavelet subspaces giving the oblique projection onto their corresponding duals. This will be reported in the separate work [9]. 4.5 Wavelets We would like to construct bases {ψ j,l,k } k,l for the spaces W j, j, such that V j W j = V j+. To achieve this, after having built {{ϕ j,l,k } k Kj,l } l and {{ϕ j,l,k } k Kj,l } l, we can apply our modified Gram-Schmidt procedure with geometric pivoting to the set of functions {(P j P j+ )ϕ j,l,k } k Kj,l, which will yield an orthonormal basis Ψ j of wavelets for the orthogonal complement W j of V j+ in V j. Moreover one can easily prove upper bounds for the diameters of the supports of the wavelets so obtained and for their decay. 4.6 Vanishing moments for the scaling functions and wavelets In Euclidean settings vanishing moments are usually defined via orthogonality relations to subspaces of polynomials up to a certain degree. In our setting the natural subspaces with respect to which to measure orthogonality is the set of 9

20 eigenfunctions ξ λ of T. So the number of vanishing moments of an orthonormal scaling function ϕ j,l,k can be defined as the number of eigenfunctions corresponding to eigenvalues in σ T,j \ σ T,j+ (as defined in (4.9)) to which T j ϕ j,l,k is orthogonal up to precision ɛ. Observe this is comparable to defining the number of vanishing moments based on T j+ ϕ j,l,k, which evokes classical estimates in multiscale theory of Calderón-Zygmund operators. In section 5 we will see that there is an approximately monotonic relationship between the index l for the layer of scaling functions and the number of eigenfunctions to which the scaling function are orthogonal, i.e. the number of moments of the scaling functions. In particular, the mean of the scaling functions is typically roughly constant for the first few layers, and then quickly drops below precision as l grows: this would allow to split each scaling function space in two subspaces, one with scaling functions of non zero mean, and one of scaling functions with zero mean, which in fact could be appropriately called wavelets. 5 The orthogonalization step: computational and numerical considerations In this section we discuss details of the implementation, comment on the computational complexity of the algorithms, and on their numerical stability. Suppose we are given a δ local family Φ ={ ϕ k } k K of positive functions on X, andletx = { x k } k K be a supporting set for Φ. With obvious meaning, we will sometimes write ϕ xk for ϕ k.letv = < { ϕ k } k K >. We want to build an orthonormal basis Φ = {ϕ k } k K whose span is ɛ-close to V. Out of the many possible solutions to this standard problem (see e.g. [45,46] and references therein as a starting point), we seek one for which the ϕ k s have small support (ideally of the same order as the support of ϕ k ). Standard orthonormalization in general may completely destroy the size the of the support of (most of) the ϕ k s. We suggest two algorithms that try to remedy to this problem: the first one uses geometric information on the set (hence something outside the scope of linear algebra) to derive a pivoting rule that guarantees several orthogonal functions to have small support; the second one uses uniquely the matrix representing T in a local basis, and a nonlinear pivoting rule mixing L and L p norms, p<. The computational cost of constructing the Multi-Resolution Analysis described in the Theorem can be as high as O(n 3 ) in general, where n is the

21 cardinality of X. However, there are two distinct types of properties of T that would allow to dramatically improve the time necessary for the construction: I. The decay of σ(t ): the faster the spectrum decays, the smaller the numerical rank of the powers of T, the more these can be compressed, hence the faster the construction of Φ j for large j s. II. If each basis Φ j of scaling functions that we build is such that the elements with large support (large layer index l) are in a subspace spanned by eigenvectors of T corresponding to very small eigenvalues (depending on l), then these basis functions of large support will not be needed to compute the next(dyadic) power of T. This implies that the matrices representing all the basis transformations will be uniformly sparse. A combination of these two hypothesis, which are in fact quite natural and verified by many operators arising in practice, allow to perform the construction of the whole Multi-Resolution Analysis described in Theorem 8 in time O(n log n)oreveno(n). Item (I.) above can only be an assumption. Item (II.) however is a question about the algorithms from linear algebra used to construct the bases of scaling functions. We want to comment here on several possible choices of algorithms to be used. (a) The first one is called modified Gram-Schmidt with pivoting, it is classical in numerical analysis (see for example [46]). (b) The second one is also based on finite dimensional linear algebra, and is a refinement of the first. It constructs so-called Rank Revealing QR factorizations. Being purely based on linear algebra, it holds with minimum assumptions on T, and this comes at the expense of computational speed. (c) The third one is based on analysis and geometry, and applies at least if T looks like a Laplacian, in a sense to be made precise below. In this case there are generalized Heisenberg principles, or sampling theorems, that give information on the critical sampling rate necessary to approximate eigenvectors of T, depending on the corresponding eigenvalues. One can then construct the scaling functions bases using this information, and in this way guarantee that property (II) is satisfied. 5. Modified Gram-Schmidt with pivoting Given a set of functions Φ onx, andɛ>, the algorithm Modified Gram- Schmidt with Pivoting computes an orthonormal basis Φ for a subspace ɛ- close < Φ >, as follows.

22 Let Φ = Φ, k =. Let ϕ k be an element of Φ with largest L norm. If ϕ k < ɛ,stop, Φ otherwise add ϕ k / ϕ k to Φ. 3 Orthogonalize all the remaining elements of Φ toϕ k, obtaining a new set Φ. Increment k by. Go to step. When the algorithm stops, we have obtained an orthonormal basis Φ whose span is ɛ-close to < Φ >. This follows from the fact the singular values of the family Φ Φ are the same as those of Φ, however all the columns of Φ have L -norm smaller than ɛ, so the singular values of Φ cannot be larger than Φ Φ ɛ. This procedure can be shown to be stable numerically, at least when Φ isnot too ill-conditioned [46,?]. When the problem is very ill-conditioned, a loss of orthogonality in Φ may occur while running the algorithm. Reorthogonalizing the elements already selected in Φ is enough to achieve stability, see [45] for details and references. The main drawback of this procedure is that the supports of the functions in Φ can be arbitrarily large even if the supports of Φ are small, and there exists another orthonormal basis for < Φ > made of well-localized functions. Since the size of the supports of the functions in the new basis are crucial for the speed of the algorithm, we suggest two modifications of the modified Gram-Schmidt that seek to obtain orthonormal bases of functions with smaller support. 5. Multiscale modified Gram-Schmidt with geometric pivoting We suggest here to use a fast local Gram-Schmidt orthogonalization with geometric pivoting, that provides some guarantees on the size of the support of the orthonormal functions it constructs, besides having complexity O(n logn), where n is the cardinality of X, when applied to a local family. Here we are assuming that the approximate (up to precision) search for the ɛ-neighbors of any point in X has complexity at most log n. To achieve this, X may need to be pre-processed, and state of the art fast nearest neighbor algorithms achieve this in time O(n +η log n), with constant exponential in the dimension of X, defined in a reasonable way. The literature on fast approximate nearest neighbors and range searches is vast, see for example [47] as a starting point. Here we limit ourselves to observe that since T is local and is given represented in a δ-local basis, we already have a very valuable amount of information about the nearest neighbors of each point, which can be exploited.

23 In the following subsection we will present an algorithm that does not use the geometry of the supports and does not need to know nearest neighbors. The Gram-Schmidt procedure is completely local, in the sense that it is organized so that each of the constructed functions needs to be orthonormalized only to the nearby functions, across scales, at the appropriate scale. PROOF. [Construction of Proposition 4, greedy version] We start by building K so that {x,k } k K is a δ-lattice, following [] (see also [48]). We pick x, S arbitrary, then we find a x, S \ (S B δ (x, )) closest to x, and such that ϕ x, is δ-local (when there is more than one, choose one randomly) and so on: when x,,...,x,k S have been picked, pick x,k+ S \(S N δ ({x,,...,x,k })) closest to N δ ({x,,...,x,k }), and such that ϕ x,k+ is δ-local, till no such point can be found. The set {x,k } k K is a δlattice by construction, and the functions ϕ,k := ϕ,k, k K are orthogonal because they are δ-local and hence have disjoint compact supports. Of course we normalize them in L before proceeding. Now locally orthogonalize all the remaining ϕ k to the selected {ϕ,k } k K (this operation is local as well): the resulting set of residual functions Φ = ϕ k is ( +)δ-local. We can repeat the procedure above on Φ by building a ( +)δ-lattice K. In the actual numerical implementation, we pick x,k such that the corresponding residual function ϕ x,k has the largest L -norm among the remaining ones (like in pivoted Gram-Schmidt). The thesis follows by induction: after having built Φ Φ l = {{ϕ l,k } k Kl } l=,...,l we orthogonalize the remaining ϕ k to < Φ,...,Φ l >, thus getting a set Φl. Each function in Φl is ( l+ )δ-local: we find a ( l+ )δ-lattice K l + and normalize the family { ϕ k } k Kl of functions with disjoint supports to + get {ϕ l +,k}. As before, in the implementation we choose k Kl ϕ + l +,k that has the maximum norm among the residual functions Φl. In the non-greedy version of the algorithm, at each level, rather than choosing one closest scaling function in Φl with disjoint support to the ones already chosen at that level, we choose the scaling function with largest L norm among the ones in the corona N ( l+ δ )δ,( δ )δ: any of these is guaranteed l+ to have support disjoint from the previous ones (but not too far away from at least one of them), and choosing the one with largest L norm is the natural choice for pivoting to maintain numerical stability. We stop the construction when no element ϕ k has L -norm greater than ɛ. 3

24 With these modifications, this construction inherits all the stability properties of modified Gram-Schmidt with pivoting, at least in the case in which ϕ x is between two close constants, since in this case we are simply rearranging the pivoting based on geometric information, resulting in better packings, localization of the orthogonalization computations, and achieving the promised computational cost. Remark When the initial system Φ is very ill-conditioned and orthogonality of Φ is crucial, at each level l we need to re-orthogonalize the system built so far: see [49,45] for details and references (and observe that this only about doubles the overall cost). At least when X is finite, the construction above is a particular, geometricdriven, QR decomposition of the matrix representing the change of basis from Φ to Φ. More precisely, let ( Φ) kj = ϕ k (x j ), with k Kand j X (with the identification between j and x j ), and similarly let (Φ) kj = ϕ k (x j ), with k running through the indices in K,...,K L in this order. Assume that Φ is δ-local. Then we have, up to a permutation of the columns, ϕ,... ϕ, K ϕ, D, E Φ= ϕ, K =, D, Φ (5.) E J, E J, D J,J... ϕ L,... ϕ L, KL where D l,l is a K l K l diagonal matrix, and E i,j, which orthogonalizes the residual functions Φi to the subspace spanned by the first j levels, i.e. < Φ,...,Φ j >,isofsize K i K j and sparse. In fact, it is important to realize that by construction each row of E i,j contains about B l δ B i δ Cl i X non-zero entries. Hence in the whole matrix there are about 4

25 log X l= l i= K l C l i X log X l= K l C l X C l+ X C X log X X C l l= CX l X X log X non-zero entries. Of these, for large classes of useful operators, the number of entries above precision is actually of the order of X. It is not by chance that the form of the matrix above is very similar to the ones for wavelet compressed Calderón-Zygmund operators as in []. There the wavelets for the compression of the operator where fixed and universal, here they are constructed in arbitrary geometries and adapted to the operator. Remark The construction in this proposition applies in particular to the fast construction of local orthonormal (multi-)scaling functions in R n,which we think is of independent interest. As an example, this construction can be applied to spline functions of any order, yielding an orthonormal basis of splines different from the standard one due to Strömberg [5]. Remark 3 The cost of running the above algorithm is O(n log n) times the cost of finding (approximate) r-neighbors, in addition to the cost of any preprocessing that may be necessary to perform these searches fast. This follows from the fact that all the orthogonalization steps are local. Observe that because of the geometric assumptions on Φ and T, essentially all the information needed regarding nearest neighbor is encoded in the matrix representing T. 5.3 Modified Gram-Schmidt with mixed L -L p pivoting In this section we propose a second algorithm for computing an orthogonal basis spanning the same subspace as a given δ-local family Φ. The main motivation for the algorithm in the previous section was the observation that modified Gram-Schmidt with pivoting in general can generate basis functions with large supports, while it is crucial in our setting to find orthonormal bases with rather small support. We would like to introduce a term in modified Gram-Schmidt that prefers functions with smaller support to functions with larger support. On the unit ball of L, the functions with concentrated support are those in L or, more generally, in L p,forsomep<. We can then modify the modified Gram-Schmidt with pivoting algorithm of section 5. as follows. 5

26 Let Φ = Φ, k =,λ>, p [, ]. Let ϕ k the element of Φ with largest L norm, say N k = ϕ k.among all the elements of Φ withl norm larger than N k /λ, pick the one with the smallest L p norm: let it be ϕ k.if ϕ k < ɛ, stop, otherwise add Φ ϕ k / ϕ k to Φ. 3 Orthogonalize all the remaining elements of Φ toϕ k, obtaining a new set Φ. Increment k by. Go to step. Choosing the element by bounding from below its L norm is important for numerical stability, but relaxing the condition of picking the element with largest L norm allows us to pick an element with smaller L p norm, yielding potentially much better localized bases. The parameter λ controls this slack, and in practice choosing it between 3/4 and gave good results. It is easy to construct examples in which the standard modified Gram-Schmidt with L pivoting (which corresponds to λ = ) leads to bases with most elements having large support, while our modification with mixed norms yields much better localized bases. Remark 4 Observe that this algorithm does not require knowledge of the nearest neighbors. In the implementation, to achieve maximum speed, the matrix representing T should be in sparse form and queries of non-zero elements by rows and columns should be fast. We have not investigated theoretically the stability of this algorithm. We just observed that it worked very well in the examples tried, with stability comparable to the standard modified Gram-Schmidt with pivoting, and very often much better localization in the resulting basis functions. 5.4 Rank Revealing QR factorizations We refer the reader to [5], and references therein, for details regarding the numerical and algorithmic aspects related to the rank-revealing factorizations discussed in this section. Definition 5 A partial QR factorization of a matrix M Mat(n, n) MΠ =QR = Q A k B k, (5.) C k where Q is orthogonal, A k Mat(k, k) and is upper-triangular with nonnegative diagonal elements, B k Mat(k, n k), C k Mat(n k, n k), andπ is 6

27 a permutation matrix, is a strong rank-revealing QR factorization if σ min (A k ) σ k(m) p (k, n) and σ j(c k ) σ k+j (M) p (k, n) (5.3) and ( A k k) B p (k, n) (5.4) where Q in orthogonal, p (k, n) and p (k, n) are functions bounded by a lowdegree polynomial in k and n, andσ max (A) and σ min (A) denote the largest and smallest singular values of a matrix A. Modified Gram-Schmidt with pivoting, as described in section 5. actually yields a decomposition satisfying (5.3), but with p depending exponentially in n [5]. Let SRRQR(M,ɛ) be an algorithm that computes a rank-revealing QR factorization of M, withσ k+ ɛ. Gu and Eisenstat present such an algorithm, that satisfies (5.3) and (5.4) with p (k, n) = +nk(n k) andp (k, n) = n. Moreover, this algorithm requires in general O(n 3 ) operations, but faster algorithms exploiting sparsity of the matrices involved can be devised. We can then proceed in the construction of the Multi-Resolution Analysis described in Theorem 8 as follows. We replace each orthogonalization step G j by the following two steps. First we apply SRRQR(Tj j,ɛ)togetarankrevealing QR factorization of T j, on the basis Φ j,whichwewriteas Tj j Π j = Q j R j = Q j A j,k B j,k. (5.5) C j,k The columns of Q j span Ṽj+ up to error ɛ p(k, n), where k =argmax i {i : λ j+ i ɛ} =dim(ṽj+). In fact, the strong rank-revealing QR decomposition above shows that the first k columns of T j Π j are a well-conditioned basis that ɛ p(k, n)-spans Ṽ j+. Now, this basis is a well-selected subset of bump functions T j+ Φ, and can be orthogonalized with Proposition 4, with estimates on the supports exactly as claimed Theorem 8. 7

28 5.5 Computation of the Wavelets In the computation of wavelets as described in section 4.5, numerically one has to make sure that, at every orthogonalization step, the construction of the wavelets is made on vectors that are numerically orthogonal to the scaling functions at the previous scale. This can be attained, in a numerically stable way, by repeatedly orthogonalizing the wavelets to the scaling functions. Observe that this orthogonalization is again a local operation, and hence can be computed fast. This is also necessary to guarantee that the construction of wavelets will stop exactly at the exhaustion of the wavelet subspace, without spilling into the scaling spaces. 6 Fast Algorithms Assume that both (I) and (II) in section 5 are satisfied or, more generally, that each orthonormalization step and change of basis can be computed in time O(n log n). Then there are fast algorithms for constructing the whole Multi-Resolution Analysis: Proposition 6 The construction of all the scaling functions, and linear transformations {G j } and {M j },forj =,...,J, as described in Theorem 8, can be done in time O(n log n). Corollary 7 (The Fast Scaling Function Transform) Let the scaling function transform of a function f L (X, µ), X = n, be the set of all coefficients {< f,φ j >} j=,...,j. All these coefficients can be computed in time O(n log n). Remark 8 For various interesting classes of operators arising in applications, the computational complexity of the transform is actually O(n) instead of O(n log n) since the number of entries in M j above precision is n instead of log n. The following is a simple consequence of Corollary 7: Corollary 9 (The Fast Wavelet Transform) Let the wavelet transform of a function f L (X, µ) be the set of all coefficients {< f,ψ j >} j=,...,j. All these coefficients can be computed in time O(n log n). 8

29 Remark 3 Remark 8 applies here as well. 6. Fast eigenfunction computation The computation of approximations to the eigenfunctions ξ λ corresponding to λ>λ can be computed efficiently in V j,wherej is min{j : λ t j >ɛ}, and then the result can be extended to the whole space X by using the extension formula (4.4). These eigenfunctions can also be used to embed the coarsened space X j into R n, along the lines of section 4., or can be extended to X and used to embed the whole space X. See [5,35] for results of multigrid techniques applied to fast eigenfunction computations. 6. Fast inversion of Laplacian-like operators Given a fast method like ours to compute the powers of T, the Laplacian (I T ) can be inverted (on the orthogonal complement of the eigenspace corresponding to the eigenvalue ) via the Schultz method []: since and, if S K = K k= T k,wehave (I T ) f = S K+ = S K + T K S K = + k= T k f K ( I + T k) f. Since we can apply fast T k to any function f, and hence the product S K+, we can apply (I T ) to any function f fast, in time O(Knlog n). The value of K depends on the gap between and the first eigenvalue smaller than, which essentially regulates the speed of convergence of a distribution to its asymptotic distribution. Observe that we never construct the full matrix representing (I T ) (which is general full!), but we only keep a compressed multiscale representation of it, which we can use to compute the action of the operator on any function. k= 9

30 6.3 Relationship with eigenmap embeddings In Section 4. we discussed metrics induced by a symmetric diffusion semigroup {T t } acting on L (X, µ), and how eigenfunctions of the generator of the semigroup can be used to embed X in Euclidean space. Similar embeddings can be obtained by using diffusion scaling functions and wavelets, since they span subspaces localized around spectral bands. So for example the top diffusion scaling functions in V j approximate well, by the arguments in this Section, the top eigenfunctions. See for example Figure 6. 7 Natural multiresolution on sampled Riemannian manifolds and on graphs The natural construction of Brownian motion and of the associated Laplace- Beltrami operator for compact Riemannian manifolds can be approximated on a finite number of points which are realizations of a random variable taking values on the manifold according to a certain probability distribution as in [4,3]. The construction starts with assuming that we have a data set Γ which is obtained by drawing according to some unknown distribution p, and whose range is a smooth manifold Γ. Given a kernel that satisfies some natural mild conditions, for example in the standard Gaussian form x y K(x, y) =e ( δ ) for some scale factor δ>, is normalized twice as follows before computing its eigenfunctions. The main observation is that any integration on the empirical data set Γ of the kernel against a function is in the form K(x, y i )f(y i ) Γ and thus it a Riemann sum associated to the integral K(x, y)f(y)p(y)dy. Γ Hence to capture only the geometric content of the manifold it is necessary to get rid of the measure p on the dataset. This can be estimated at scale δ for example by convolving with some smooth kernel (for example K itself), thus obtaining an estimated probability density p δ. One then considers the kernel K(x, y) = K(x, y) p δ (y) 3

31 as new kernel. This kernel is further normalized so that it becomes averaging on the data set, yielding K(x, y) = K(x, y) Γ K(x, y)p(y)dy. It is shown in [3] that with this normalization, in the limit Γ + and δ, the kernel K thus obtained is the one associated with the Laplace- Beltrami operator on Γ. Applying our construction to this natural kernel, we obtain the natural multiresolution analysis associated to the Laplace-Beltrami operator on a compact Riemannian manifold. This also leads to compressed representation of the heat flow and of functions of it. On weighted graphs, the canonical random walk is associated with the matrix of transition probabilities P = D W where W is the symmetric matrix of weights and D is the diagonal matrix defined by D ii = j W ij. P is in general not symmetric, but it is conjugate, via D to the symmetric matrix D WD which is a contraction on L. Our construction allows the construction of a multiresolution on the graph, together with downsampled graphs, adapted to the random walk [3]. 8 Examples and applications 8. Orthogonal spline multiresolution Spline multiresolution analyses are well-studied in the wavelet literature (see for example [5,44,53 55] and references therein, for multiwavelets see for example [56,57] and references therein) and used in several applications, image analysis, finite elements methods for PDEs and others. The classical spline functions of order L on the real line, with knots on the integer grid, are piecewise polynomials of degree L on each interval of the form [k, k +], k Z, and globally of class C L. The basic spline function of order L is defined by ( ) +e iξ L+ ˆϕ(ξ) = so that ϕ(x) =( ) L+ χ [,] (x). The space of splines with knots on the integer grid is spanned by {ϕ( k)} k Z, and in fact this is a Riesz basis. To obtain an orthonormal basis these functions are usually orthonormalized via a translation-invariant Gram-Schmidt 3

V V V 3 V 4 V 5.5.8.6.4. 555 V 6.8.6.4.. 555 V 7.6.4...4 555 V 8.4...4 555 V 9 555 V.4......... 555 V. 555 V. 555 V 3. 555 V 4 555 V 5....5..5..5.. 555.5. 555.5 555.5 555 555 Fig. 3. Multiresolution Analysis on the circle.

32 V V V 3 V 4 V V V V V V V. 555 V. 555 V V V Fig. 3. Multiresolution Analysis on the circle. We consider 56 points on the unit circle, start with ϕ,k = δ k and with the standard diffusion. We plot several scaling functions in each approximation space V j Fig. 4. Multiresolution Analysis on the circle: on the left we plot the compressed matrices representing powers of the diffusion operator, on the right we plot the entries of the same matrices which are above working precision procedure [5,58] yielding the so called Franklin scaling functions and the corresponding wavelets, which have support on the whole real line, but still are exponentially decaying. To preserve the compactness of the support, [44] built biorthogonal splines. We can apply our construction to this situation, by letting X = R (or Z, or perturbed versions of Z!) with standard distance and measure, and the initial family of scaling functions being the family of splines of order L, {ϕ L ( k)} k Z (restricted to Z if X = Z).. Our Gram-Schmidt orthogonalization with geometric pivoting yields infinitely many different scaling functions, and scaling functions at substrate l having support of size comparable to l diam.supp. ϕ L. 3

33 V V V3 V4 V5 V6 V7 V8 V9 V V W.5 W.5 W3 W4 W5 W6 W7 W8 W W.5.5 W.5 x V V V3 V4 V5 V6 V7 V8 V9 V V x 5 W x 4 W x 4 W x 3 W x 3 W W6. W W8 W W W Fig. 5. Multiresolution Analysis on the circle. In the same setting as for Figure 3, we compute the multiscale transform of a periodic signal on the circle, contaminated by two δ-impulses (top) and of windowed chirp (bottom). In the first column we plot the projections onto coarser and coarser scaling spaces, in the second column we plot the projection on the corresponding wavelet subspaces. Computations here were done to 5 digits of precision. 8. A simple homogenization problem We consider the non-homogeneous heat equation on the circle u t = ( c(x) ) x x u (8.) where c(x) is quite non-uniform, and we want to represent the large scale/large time behavior of the solution by compressing powers of the operator ( ) representing the discretization of the spatial differential operator x c(x) x.the spatial operator is of course one of the simplest Sturm-Liouville operators in 33

9 6 9.3 6.5. 5 c(x) 3 5 φ 5. 3.5 8 8 33 33 4 3 4 3 7 7...5.5...5.5.5.5...5.5 5 5 5 5 5 5.5 5.5 5..5 5 5...5.5..5 Fig. 6. Non homogeneous medium with non-constant diffusion coefficient (plotted top left): the scaling functions for V 5 (top right).

34 c(x) 3 5 φ Fig. 6. Non homogeneous medium with non-constant diffusion coefficient (plotted top left): the scaling functions for V 5 (top right). In the middle row: an eigenfunction (the 35th) of the diffusion operator (left), and the same eigenfunction reconstructed by extending the corresponding eigenvector of the compressed T (right): the L -error is of order 5. The entries above precision of the matrix T 8 compressed on V 4 (bottom left). We plot bottom right {(ϕ 8, (x i ),ϕ 8, (x i ))} i,which are an approximation to the eigenmap corresponding to the first two eigenfunctions, since V 8 has dimension 3. 34

Diffusion Wavelets and Applications

Diffusion Wavelets and Applications J.C. Bremer, R.R. Coifman, P.W. Jones, S. Lafon, M. Mohlenkamp, MM, R. Schul, A.D. Szlam Demos, web pages and preprints available at: S.Lafon: www.math.yale.edu/~sl349