The Nyström Extension and Spectral Methods in Learning

Size: px

Start display at page:

Download "The Nyström Extension and Spectral Methods in Learning"

Joshua Johnston
5 years ago
Views:

1 Introduction Main Results Simulation Studies Summary The Nyström Extension and Spectral Methods in Learning New bounds and algorithms for high-dimensional data sets Patrick J. Wolfe (joint work with Mohamed-Ali Belabbas) Statistics and Information Sciences Laboratory School of Engineering and Applied Sciences Department of Statistics, Harvard University Future Directions in High-Dimensional Data Analysis Newton Institute Workshop, Cambridge University 26 June 2008 Wolfe (Harvard University) Spectral Methods in Learning June / 37

2 Introduction Main Results Simulation Studies Summary Outline Approximation of matrix products 1 Introduction Spectral methods and statistical learning Approximating a positive semi-definite kernel Discriminating between data and information 2 Nyström Approximation and Multi-Index Selection The Nyström extension as an approximation method Randomized multi-index selection by weighted sampling Deterministic multi-index selection by sorting 3 Numerical Results and Algorithmic Implementation Approximate sampling Low-rank kernel approximation Methods for nonlinear embeddings Wolfe (Harvard University) Spectral Methods in Learning June / 37

3 Introduction Main Results Simulation Studies Summary Outline Approximation of matrix products 1 Introduction Spectral methods and statistical learning Approximating a positive semi-definite kernel Discriminating between data and information 2 Nyström Approximation and Multi-Index Selection The Nyström extension as an approximation method Randomized multi-index selection by weighted sampling Deterministic multi-index selection by sorting 3 Numerical Results and Algorithmic Implementation Approximate sampling Low-rank kernel approximation Methods for nonlinear embeddings Wolfe (Harvard University) Spectral Methods in Learning June / 37

4 Introduction Main Results Simulation Studies Summary Outline Approximation of matrix products 1 Introduction Spectral methods and statistical learning Approximating a positive semi-definite kernel Discriminating between data and information 2 Nyström Approximation and Multi-Index Selection The Nyström extension as an approximation method Randomized multi-index selection by weighted sampling Deterministic multi-index selection by sorting 3 Numerical Results and Algorithmic Implementation Approximate sampling Low-rank kernel approximation Methods for nonlinear embeddings Wolfe (Harvard University) Spectral Methods in Learning June / 37

5 Introduction Main Results Simulation Studies Summary Spectral Methods Kernel Approximation Data vs. Information Spectral Methods in Learning The discrepancy between data and information What role do spectral methods play in statistical learning? Goal: get relevant information about very large datasets in very high dimensional spaces Image segmentation, low-dimensional embeddings,... What is the relevant information contained in the data set? Spectral methods reduce this question to finding a low-rank approximation to a symmetric, positive semi-definite (SPSD) kernel equivalently, a quadratic form or positive kernel They can be quite effective, and see wide use: Older methods: principal components analysis (1901), multidimensional scaling (1958),... Newer methods: isomap, Laplacian eigenmaps, Hessian eigenmaps, diffusion maps,... Wolfe (Harvard University) Spectral Methods in Learning June / 37

6 Introduction Main Results Simulation Studies Summary Spectral Methods Kernel Approximation Data vs. Information Application of Low-Rank Approximations to Learning Inner and outer characteristics of the point cloud Let {x 1,..., x n } be a collection of data points in R m. Spectral methods can be classified according to whether they rely on: Outer characteristics of the point cloud (PCA, discriminants). Here we work directly in the ambient space. Require spectral analysis of a positive-definite kernel of dimension m, the extrinsic dimensionality of the data. Inner characteristics of the point cloud (MDS, extensions). Embedding requires the spectral analysis of a kernel of dimension n, the cardinality of the point cloud. The spectral analysis task typically consists of finding a rank-k approximation to a symmetric, positive semi-definite matrix. Wolfe (Harvard University) Spectral Methods in Learning June / 37

7 Introduction Main Results Simulation Studies Summary Spectral Methods Kernel Approximation Data vs. Information How to Approximate a Positive Kernel, in Theory? Finding a low-rank approximation is easy... An SPSD matrix Q can be written in spectral coordinates Q = UΛU T, where U is orthogonal and Λ = diag(λ 1,..., λ n ) is diagonal. The λ i s are the eigenvalues of Q, ordered such that λ 1 λ 2... λ n 0, and the u i s are the eigenvectors. For any unitarily invariant matrix norm, we have that argmin Q Q = UΛ k U T =: Q k, Q: rank( Q)=k where Λ k = diag(λ 1, λ 2,..., λ k, 0,..., 0) Wolfe (Harvard University) Spectral Methods in Learning June / 37

8 Introduction Main Results Simulation Studies Summary Spectral Methods Kernel Approximation Data vs. Information How to Approximate a Positive Kernel, in Practice? Finding a low-rank approximation is hard! Changing to spectral coordinates is done using the Singular Value Decomposition of Q, which requires O(n 3 ) operations On a Pentium IV 3GHZ desktop PC, with 1GB RAM, 512k Cache: Extrapolating to n = 10 6, factoring Q takes more than 4 months. When n increases, Q quickly becomes too large to be stored in memory Wolfe (Harvard University) Spectral Methods in Learning June / 37

9 Introduction Main Results Simulation Studies Summary Spectral Methods Kernel Approximation Data vs. Information Approximating Very Large Kernels How to discriminate between data and information? This presents a practical problem for large data sets! A commonly used trick is to sparsify the kernel. Fix ε > 0. If Q ij ε, set Q ij = 0 Questions: How to choose ε? How accurate is the result? Alternative approach: discard some of the data. How to construct a low-rank approximation using just some of the data? The Nyström extension provides an answer The idea behind the Nyström extension is as follows: Write Q = X T X, so that Q is a Gram matrix for column vectors x 1,..., x n. Choose a subset J of vectors x i and their correlations with all the other vectors to find a low-rank approximation Q. Wolfe (Harvard University) Spectral Methods in Learning June / 37

10 Introduction Main Results Simulation Studies Summary Spectral Methods Kernel Approximation Data vs. Information A Provably Good Low-Rank Approximation Our main result on approximating quadratic forms How to choose J : J = k so as to minimize Q Q? This is equivalent to asking: How to choose the most informative part from our dataset? most informative being conditioned on our reconstruction scheme There are ( n k) such subsets no hope of enumerating Instead, we define a distribution on k-subsets by p Q,k det(q J J ), for principal submatrices Q J J Let Q have eigenvalues λ 1 λ 2... λ n 0. Our main result will be to show that for any unitarily invariant norm, E Q Q (k + 1)(λ k+1 + λ k λ n ) Wolfe (Harvard University) Spectral Methods in Learning June / 37

11 Outline Approximate Linear Algebra 1 Introduction Spectral methods and statistical learning Approximating a positive semi-definite kernel Discriminating between data and information 2 Nyström Approximation and Multi-Index Selection The Nyström extension as an approximation method Randomized multi-index selection by weighted sampling Deterministic multi-index selection by sorting 3 Numerical Results and Algorithmic Implementation Approximate sampling Low-rank kernel approximation Methods for nonlinear embeddings Wolfe (Harvard University) Spectral Methods in Learning June / 37

12 The Nyström Extension Simplify the problem Historically, the Nyström extension was introduced to obtain numerical solutions to integral equations. Let k : [0, 1] [0, 1] R be an SPD kernel and (f i, λ i ), i N, denote its pairs of eigenfunctions and eigenvalues: 1 0 k(x, y)f i (y) dy = λ i f i (x), i N. The Nyström extension approximates the eigenvectors of k(x, y) based on a discretization of the interval [0, 1]. Define the M + 1 points x i by x m = x m 1 + 1/M with x 0 = 0, so that the x i s are evenly spaced along the interval [0, 1]. Form the Gram matrix K mn := k(x m, x n ) and decompose it. Wolfe (Harvard University) Spectral Methods in Learning June / 37

13 The Nyström Extension Extend the solution We now solve a finite-dimensional problem 1 M + 1 K mn v i (n) = λ v i v i (m), i = 0, 1,..., M, n where (v i, λ v i ) represent the eigenvector-eigenvalues pairs associated with K. What do we do with these eigenvectors? We extend them to give an estimate ˆf i of the i th eigenfunction as follows: 1 ˆf i (x) = (M + 1)λ v k(x, x m )v i (m). i In essence: only use partial information about the kernel to solve a simpler eigenvalue problem, and then to extend the solution using complete knowledge of the kernel. Wolfe (Harvard University) Spectral Methods in Learning June / 37 m

14 The Nyström Extension In finite dimensions How do we translate this to a finite dimensional setting? We approximate k eigenvectors of an SPD matrix Q by decomposing and extending a k k principal submatrix of Q. We partition Q as follows: [ ] QJ Y Q = Y T, Z with Q J R k k ; we say that this partition corresponds to the multi-index J = {1, 2,..., k}. Define spectral decompositions Q = UΛU T, Q J = U J Λ J U J T. Wolfe (Harvard University) Spectral Methods in Learning June / 37

15 The Nyström Extension The approximation error The Nyström extension then provides an approximation for k eigenvectors in U as [ ] U Ũ := J Y T 1 ; Q U J Λ J = U J Λ J U T J. J In turn, the approximations Ũ U and Λ J Λ may be composed to yield an approximation Q Q according to [ ] Q := ŨΛ QJ Y JŨT = Y T Y T Q 1. J Y The resultant approximation error is Q Q = Z Y T Q J 1 Y, the norm of the Schur complement of Q J in Q. Wolfe (Harvard University) Spectral Methods in Learning June / 37

16 The Nyström Extension Another point of view What problem does the Nyström extension solve? Our answer requires the partial order on the cone of SPSD matrices. Write Q 0 is Q is SPSD, and Q Q if Q Q 0 The Nyström extension Q is the minimal (with respect to ) positive-definite projection of Q onto the range of Q J Is there a more statistical interpretation? Think linear models: Recall any Q 0 admits a Gram representation Q = X T X. Imagine selecting the first n 1 columns of X and regressing the final column x n onto these Our error term is then nothing more than the residual sum-of-squares. It is zero if and only if x n lies in the span of x 1,..., x n 1, implying singular Q and perfect reconstruction More generally, the Schur complement S C (Q J ) is recognizable as the covariance of the best linear unbiased estimator. Wolfe (Harvard University) Spectral Methods in Learning June / 37

17 Adjusting Computational Load vs. Approximation Error From eigenanalysis to partitioning... Back to our kernel approximation problem... On what fronts do we gain by using the Nyström extension? What is required is the spectral analysis of a kernel of size k n gain in space and time complexity. But we introduced another problem: how to partition Q? In other words, we have shifted the computational load from eigenanalysis to the determination a good partition The latter problem is more amenable to approximation We give two algorithms to solve it, along with error bounds... Wolfe (Harvard University) Spectral Methods in Learning June / 37

18 The Nyström Extension A combinatorial problem We now introduce the problem formally with notation: J, K {1,..., n} are multi-indices of respective cardinalities k and l, containing pairwise distinct elements in {1,..., n}. We write J = {j 1,..., j k }, K = {k 1,..., k l }, and denote by J the complement of J in {1,..., n}. Define Q J K for the k l matrix whose (p, q)-th entry is given by (Q J K ) pq = Q jpk q. Abbreviate Q J for Q J J. The partitioning problem is equivalent to selecting a multi-index J such that the error Q Q = Q J Q J J Q 1 J Q = S J J C (Q J ) is minimized for some unitarily invariant norm. Wolfe (Harvard University) Spectral Methods in Learning June / 37

19 The Nyström Method and Exact Reconstruction Recovery when rank(q J ) = rank(q) = k When does the Nyström method admit exact reconstruction? If we take for J the entire set {1, 2,..., n}, then the Nyström extension yields Q = Q trivially If Q is of rank k < n, then there exist J : J = k such that the Nyström method yields exact reconstruction These J are those such that rank(q J ) = rank(q) = k Intuition: express Q as a Gram matrix whose entries comprise the inner products of n vectors in R k Knowing the correlation of these n vectors with a subset of k linearly independent vectors allows us to recover them Information contained in Q J is sufficient to reconstruct Q; Nyström extension performs the reconstruction To verify, we introduce our first lemma... Wolfe (Harvard University) Spectral Methods in Learning June / 37

20 Verifying the Perfect Reconstruction Property Characterizing Schur complements as ratios of determinants Lemma (Crabtree-Haynsworth) Let Q J be a nonsingular principal submatrix of some SPSD matrix Q. The Schur complement of Q J in Q is given element-wise by (S C (Q J )) ij = det(q J {i} J {j}). (1) det(q J ) This implies that for J such that rank(q J ) = rank(q) = k, S C (Q J ) = Q J Q J J Q 1 J Q J J = 0. If rank(q) = k = J, then (1) implies that diag(s C (Q J )) = 0 Q 0 implies S C (Q J ) 0 for any multi-index J, allowing us to conclude that S C (Q J ) is identically zero. Wolfe (Harvard University) Spectral Methods in Learning June / 37

21 A Connection to Compound Matrices Characterizing Schur complements as ratios of determinants By Crabtree-Haynsworth we may write tr(s C (Q J )) = det(q J ) 1 i / J det(q J {i} ) This is in fact the trace norm of S C (Q J ). It is unitarily invariant and dominates all unitarily invariant matrix norms Controlling the trace norm S C (Q J ) tr = Q Q tr is hence key to bounding the Nyström approximation error Eigenvalue majorization results tell us worst-case results but little else. Ideally we would like a relative-error approximation that is invariant to rotation Recalling the optimal approximation via SVD, the best possible bound would be multiplicative in the minimal trace-norm error n i=k+1 λ i(q) Wolfe (Harvard University) Spectral Methods in Learning June / 37

22 A Connection to Compound Matrices Characterizing Schur complements as ratios of determinants Crabtree-Haynsworth provides an entry-wise characterization of the Schur complement in terms of ratios of determinants To exploit this, we require the notion of exterior algebras by way of compound matrices. Specifically, the kth compound matrix Q (k) of Q is comprised of all determinants det(q J K ) of k-subsets J K The following important result allows us to express tr(q (k) ) as a sum of k-fold products of eigenvalues of Q: Theorem (Cauchy-Binet) Let Q admit the spectral decomposition Q = UΛU T. Then if Q (k) denotes the kth compound matrix of Q, we have that (UΛU T ) (k) = U (k) Λ (k) (U T ) (k). Wolfe (Harvard University) Spectral Methods in Learning June / 37

23 Randomized Multi-Index Selection by Weighted Sampling Statement of the main result Since Q 0, we may use it to induce a probability distribution on the set of all J : J = k as follows: p Q,k (J) det(q I ). Thus, we may first sample J p Q,k (J), then perform the Nyström extension on the chosen multi-index: Theorem (Randomized Multi-Index Selection) Let Q 0 have eigenvalues λ 1... λ n 0. Let Q be the Nyström approximation to Q corresponding to J, with J p Q,k (J) det(q J ). Then for any unitarily invariant matrix norm, E Q Q (k + 1) n i=k+1 λ i (2) Wolfe (Harvard University) Spectral Methods in Learning June / 37

24 Proof of the Randomized Multi-Index Result I Randomized algorithm for multi-index selection Proof. We seek to bound E Q Q tr = 1 tr(q (k) ) J, J =k The Crabtree-Haynsworth Lemma yields det(q J ) S C (Q J ) tr. tr(s C (Q J )) = i / J det(q J {i} ), det(q J ) and thus E Q Q tr = 1 tr(q (k) ) J, J =k i / J det(q J {i} ). (3) Wolfe (Harvard University) Spectral Methods in Learning June / 37

25 Proof of the Randomized Multi-Index Result II Randomized algorithm for multi-index selection Proof. Every multi-index of cardinality k + 1 appears exactly k + 1 times in the double sum of (3) above, whence E Q Q tr = (k + 1) tr(q (k) ) J, J =k+1 det(q J ) = (k + 1) tr(q(k+1) ) tr(q (k) ). (4) By Cauchy-Binet, the sum of the principal (k + 1)-minors of Q can be expressed as the sum of (k + 1)-fold products of its ordered eigenvalues, and so factoring we obtain E Q Q tr = (k + 1) tr(q(k+1) ) tr(q (k) ) (k + 1) n λ i (Q) tr(q(k) ) tr(q (k) ). i=k+1 Wolfe (Harvard University) Spectral Methods in Learning June / 37

26 Deterministic Low-Rank Kernel Approximation Deterministic multi-index selection by sorting We now present a low-complexity deterministic multi-index selection algorithm and provide a bound on its worst-case error Let J contain the indices of the k largest diagonal elements of Q and then implement the Nyström extension. Then we have: Theorem (Deterministic Multi-Index Selection) Let Q be a positive-definite kernel, let J contain the indices of its k largest diagonal elements, and let Q be the corresponding Nyström approximation. Then for any unitarily invariant matrix norm, Q Q i / J Q ii. (5) Wolfe (Harvard University) Spectral Methods in Learning June / 37

27 Proof of the Deterministic Multi-Index Result I Deterministic algorithm for multi-index selection We have sacrificed some power to obtain gains in the deterministic nature of the result and in computational efficiency: Q Q n i=k+1 Q ii (sorting) vs. E Q Q (k + 1) n i=k+1 λ i (sampling) The proof of this theorem is straightforward, once we have the following generalization of the Hadamard inequality: Lemma (Fischer s Lemma) If Q is a positive-definite matrix and Q J a nonsingular principal submatrix then det(q J {i} ) < det(q J )Q ii. Wolfe (Harvard University) Spectral Methods in Learning June / 37

28 Proof of the Deterministic Multi-Index Result II Deterministic algorithm for multi-index selection Proof of the Theorem. We have from our earlier proof that Q Q tr(s C (Q J )); applying Crabtree-Haynsworth in turn gives Q Q 1 det(q J ) det(q J {i} ), after which Fischer s Lemma yields Q Q i / J Q ii. i / J Though we cannot say more without formally adopting a measure on the cone of SPSD matrices, we have observed this simple algorithm appears to work well in a variety of practical settings Other stepwise-greedy and sampling approaches to multi-index selection are also available through the notion of pinchings Wolfe (Harvard University) Spectral Methods in Learning June / 37

29 Remarks and Discussion Key points Determinantal sampling exploits the full power of Cauchy-Binet and Crabtree-Haynsworth, yielding a relative-error bound that depends only on the spectrum of Q Assigning zero mass to singular subsets is key pay k 3 for this Maximizing det(q J ) is not in general superior and is NP-hard Tridiagonal restrictions of subsets Q J interpolate our two results in a natural way, both in terms of bounds and algorithmic complexity Fischer s Lemma is an extreme example In general, You pay more to get more Majorization results can be used to obtain quick estimates of the eigenvalues of Q, and hence of the best or expected error of the Nyström approximation If the spectrum of Q does not exhibit fast decay, one should be cautious in approximating it! Wolfe (Harvard University) Spectral Methods in Learning June / 37

30 Remarks and Discussion Comparison to known results Drineas et al. (2005) proposed to choose row/column subsets by sampling, independently and with replacement, indices in proportion to elements of {Qii 2}n i=1, and showed only: E Q Q F Q Q k F n Qii 2, i=1 Our randomized approach yields a relative error bound Algorithmic complexity: O(k 3 + (n k)k 2 ) Our deterministic approach offers improvement if tr(q) n; complexity O(k 3 + (n k)k 2 + n log k) Connections to the recently introduced notion of volume sampling in theoretical computer science Wolfe (Harvard University) Spectral Methods in Learning June / 37

31 Introduction Main Results Simulation Studies Summary Sampling Sampling Embeddings Outline Approximate Linear Algebra 1 Introduction Spectral methods and statistical learning Approximating a positive semi-definite kernel Discriminating between data and information 2 Nyström Approximation and Multi-Index Selection The Nyström extension as an approximation method Randomized multi-index selection by weighted sampling Deterministic multi-index selection by sorting 3 Numerical Results and Algorithmic Implementation Approximate sampling Low-rank kernel approximation Methods for nonlinear embeddings Wolfe (Harvard University) Spectral Methods in Learning June / 37

32 Introduction Main Results Simulation Studies Summary Sampling Sampling Embeddings Implementation of the Sampling Scheme Sampling from p Q,k Sampling directly from p Q,k det(q J ) is infeasable Simulation methods provide an appealing alternative We employed the Metropolis algorithm to simulate an ergodic Markov chain admitting p Q,k (J) as its equilibrium distribution: The proposal distribution is straightforward: exchange one index from J with one index from J uniformly at random Variants such as mixture kernels are of course possible We made no attempt to optimize this choice, as its performance in practice was observed to be satisfactory Wolfe (Harvard University) Spectral Methods in Learning June / 37

33 Introduction Main Results Simulation Studies Summary Sampling Sampling Embeddings A Metropolis Algorithm for Sampling from p Q,k Implementation of the sampling scheme Implementation of the Metropolis sampler is straightforward and intuitive: Begin with data X = {x 1, x 2,..., x n } R m Initialize (in any desired manner) a multi-index J (0) of cardinality k Compute the sub-kernel Q (0) (X, J (0) ) After T iterations, return J p Q,k INPUT : Data X, 0 k n, T > 0, k k sub-kernel Q (0) with indices J (0) OUTPUT : Sampled k-multi-index J for t = 1 to T do pick s {1, 2,..., k} uniformly at random pick j s {1, 2,..., n} \ J (t 1) at random Q UpdateKernel(Q (t 1), X, s, j s) det(q with probability min(1, ) det(q (t 1) ) ) do Q (t) Q J (t) {j s} J (t 1) \ {j s } otherwise Q (t) Q (t 1) J (t) J (t 1) end do end for Wolfe (Harvard University) Spectral Methods in Learning June / 37

34 Introduction Main Results Simulation Studies Summary Sampling Sampling Embeddings Simulation Results Experimental setup First, we compare the different randomized algorithms for multi-index selection with one another, and with the method of Drineas et al. (2005): Three different settings for approximation error evaluation: Principal components, Diffusion maps, and Laplacian eigenmaps. We draw kernels at random from ensembles relevant to the test setting, and then average (though results do not imply a measure on the input space) For each kernel drawn, we further average over many runs of the randomized algorithm. Wolfe (Harvard University) Spectral Methods in Learning June / 37

35 Introduction Main Results Simulation Studies Summary Sampling Sampling Embeddings Approximate Principal Components Analysis Randomized multi-index selection Relative Error (db) Approximant Rank Sampling Uniformly 2 Sampling G ii We drew Sampling p G,k SPD matrices of rank 12 from a Wishart ensemble. We show the error of several algorithms used to perform a low rank approximation (outputs averaged over 250 trials) Wolfe (Harvard University) Spectral Methods in Learning June / 37

36 Introduction Main Results Simulation Studies Summary Sampling Sampling Embeddings Approximate Diffusion Maps Randomized multi-index selection Relative Error (db) Sampling Uniformly Sampling G ii 2 Sampling p G,k Optimal G k Approximant Rank We sample 500 points uniformly on a circle, and use the Diffusion maps algorithm to define an appropriate kernel for embedding We measured the resultant approximation error, averaged over 100 datasets and over 100 trials per set Wolfe (Harvard University) Spectral Methods in Learning June / 37

37 Introduction Main Results Simulation Studies Summary Sampling Sampling Embeddings Approximate Diffusion Maps Deterministic multi-index selection Observed cumulative probability Sampling Uniformly Sampling! G ii 2 Sampling! p G,k Relative Error At left, we plot the distribution of approximation error for fixed rank k = 8. Leftmost vertical line: optimum. Rightmost vertical line: deterministic algorithm The worst-case error bound of our deterministic algorithm can be clearly seen Wolfe (Harvard University) Spectral Methods in Learning June / 37

38 Introduction Main Results Simulation Studies Summary Sampling Sampling Embeddings Laplacian Eigenmaps Example Embedding a massive dataset We used the Laplacian eigenmaps algorithm embed the fishbowl dataset Point Realization of "Fishbowl" Data Set Sampling Uniformly 3 3 x 10 x Sampling pg,k x 10 Wolfe (Harvard University) Spectral Methods in Learning x 10 June / 37

39 Introduction Main Results Simulation Studies Summary Summary: Nyström & Spectral Methods in Learning Work supported by NSF-DMS and DARPA Two alternative strategies for the approximate spectral decomposition of large kernels were presented: Randomized multi-index selection (sampling) Deterministic multi-index selection (sorting) Simulation studies demonstrated applicability to machine learning tasks, with measurable improvements in performance Low-rank kernel approximation Methods for nonlinear embeddings Related techniques, coupled with the Nyström method, can be applied to approximate the product of two matrices See earlier Newton Web Seminar of 03 June 2008, and the SISL website: Matlab and R code forthcoming! Wolfe (Harvard University) Spectral Methods in Learning June / 37

OPERATIONS on large matrices are a cornerstone of

OPERATIONS on large matrices are a cornerstone of 1 On Sparse Representations of Linear Operators and the Approximation of Matrix Products Mohamed-Ali Belabbas and Patrick J. Wolfe arxiv:0707.4448v2 [cs.ds] 26 Jun 2009 Abstract Thus far, sparse representations