Low Rank Matrix Approximation

Size: px

Start display at page:

Download "Low Rank Matrix Approximation"

Sophia Morrison
5 years ago
Views:

1 Low Rank Matrix Approximation John T. Svadlenka Ph.D. Program in Computer Science The Graduate Center of the City University of New York New York, NY USA Abstract Low Rank Approximation is a fundamental computation across a broad range of applications where matrix dimension reduction is required. This survey paper on the Low Rank Approximation (LRA) of matrices provides a broad overview of recent progress in the field with perspectives from both the Theoretical Computer Science (TCS) and Numerical Linear Algebra (NLA) points of view. While traditional application areas of LRA have come from scientific, engineering, and statistical disciplines, a plethora of recent activity has been seen in image processing, machine learning, and climate informatics to name just a few of the emerging modern technologies. All of these disciplines and technologies are increasingly challenged by the scale of modern massive data sets (MMDS) collected from a variety of sources including sensor measurement, computational models, and Internet. At the same time, applied mathematicians and computer scientists have been seeking alternatives to the standard numerical linear algebra (NLA) algorithms that are fundamentally incapable to handle the sheer size of these MMDS s. Against the backdrop of the disparity between increasing computer processing power and disk storage capabilities on the one hand versus memory bandwidth limitations on the other, a further research goal is to find flexible new LRA approaches that leverage the strengths of modern hardware architectures while limiting exposure to data communication bottlenecks. Central to new approaches with regard to MMDS s is the approximation of a matrix by one of much smaller dimension facilitated by randomization techniques. The term low rank in the context of LRA refers to the inherent dimension of the matrix which may possibly be much smaller than both the actual number of rows and columns in the matrix. Thus, a matrix may be approximated, if not exactly represented, by another matrix containing significantly less rows and columns while still preserving the salient characteristics of the original matrix. Recent results have shown that it is possible to attain this dimension reduction using alternative approximation and probabilistic techniques where randomization plays a key role. This paper surveys classical and recent results in LRA and presents practical algorithms representative of the recent research progress. A broad understanding of potential future research directions in LRA should also be evident to readers from theoretical, algorithmic, and computational backgrounds. Keywords: Low rank approximation, Modern Massive Data Sets, random sketches, random projections, dimension reduction, QR factorization, Singular Value Decomposition, CUR, parallelism 1

2 1 Introduction Numerical Linear Algebra (NLA) provides the theoretical underpinnings and formal framework for analyzing matrices and the various operations on matrices. The results in NLA have typically been shared across a diverse spectrum of areas ranging from physical and life sciences, engineering, to data analysis. More recently, new computational disciplines and application areas have begun to test the limits of traditional NLA algorithms. Although existing algorithms have been rigorously developed, analyzed and refined over many years, certain inherent limitations of these algorithms have been increasingly exposed in both new and existing areas of application. A confluence of diverse factors as well as increased interest in NLA by researchers in Theoretical Computer Science have contributed to recent developments in NLA. Perhaps, more importantly, new perspectives on approximation and randomization as applied to NLA have generated new opportunities to further the progress in the field. A prominent factor has been the sheer size of today s MMDS s such as those encountered in data mining [1] that challenge conventional NLA algorithms. It can be shown that if one is willing to relax the requirement of high precision results for faster algorithms, then alternative algorithmic approaches with improved arithmetical computational complexity are available [2]. These alternative approaches create an approximation of the input matrix that is much smaller in size then the original. The approximate matrix is commonly known as a sketch due to the LRA randomization strategies employed to obtain the approximation. This random sketch of much smaller size replaces the original matrix in the application of interest in order to realize the computational savings. The question that naturally arises from this methodology is: how expensive is the LRA algorithm? While traditional LRA algorithms can yield a rank k approximation of an m n matrix in O(mnk) time, a randomized algorithm can reduce the asymptotic complexity to O(mn log k) [1], [2]. Another aspect of the approximation trade-off that justifies the approach lies within the context of data sets with inexact content [1]. Moreover, it can be argued that all numerical computation is itself approximate up to machine precision considerations, so an introduction of similarly inexact algorithms (up to the user-specified tolerance) is not an unreasonable proposition. If one also recognizes that other deficiencies in classical algorithms besides the computational complexity may be mitigated with alternative methods, then opportunities to realize further computational gains are possible. As an example, arithmetic computational complexity does not consider the effects of data movement to and from a computer s random access memory (RAM). Floating point operation speed continues to exceed that of memory of bandwidth [31] so that data transfer is a major performance bottleneck with regard to processing MMDS s. The problem is clearly magnified for out-of-core data sets. Algorithms that can significantly reduce the number of passes over the data set (pass efficiency) will reduce the clock time performance of an algorithm. Similarly, good data and temporal locality contributes to the likelihood of a more favorable performance profile. Data locality refers to the characteristic of an algorithm whereby a segment of data in memory is likely to be processed next if it is located close to that data currently being processed. Likewise, temporal locality is the algorithmic property such that a set of operations on adjacent sets of the data occur in close time-wise succession. The implication is that there is a lower likelihood of having to swap any components of the program data set among the various levels of the memory hierarchy multiple times. As a concrete example, consider the QR factorization which is a standard algorithm for LRA. Matrix-matrix multiplication itself may be executed faster than a QR algorithm 2

3 [2] due to block operations. Due to better use of memory hierarchies matrix-matrix kernels can perform better in general than those of matrix-vector [33, 29] as encountered in QR processing. Therefore, an LRA algorithmic alternative that is based on the former kernels rather than the latter may be more favorably disposed to computational gains. Additionally, matrix-matrix multiplication is embarrassingly parallel [1] and it more generally motivates the search for new LRA strategies that can more fully leverage the parallelism widely supported by modern hardware architectures. It should be noted from the above discussion that scalability improvements in NLA can be addressed in a multifaceted manner, that is, from the theoretical, algorithmic, and computational perspectives. This survey paper examines such hybrid algorithms which combine existing elements of conventional deterministic NLA algorithms with randomization and approximation schemes to offer new algorithms that reduce asymptotic complexity. The result is the capacity to process much larger data sets than possible with conventional algorithms. The outline of this survey paper after the current introductory section is as follows. We review classical results from the literature in Section 2 to provide some broad background on standard factorizations and algorithms for LRA. Subsequently, the more recent research concerning approximation and probabilistic results is given in Section 3. The results of these two sections form the theoretical basis for the presentation of the Randomized Hybrid LRA algorithms of Section 4 where we also discuss the strategies and benefits associated with them. Open problems are discussed in Section 5 and concluding remarks follow in Section 6. 2 Classical Results 2.1 Rank k Factorization In the previous section we mentioned that LRA is concerned with the study of matrix sketches that significantly reduce the number of rows and columns of a matrix. LRA yields two significant benefits: a savings in memory storage and a decrease in the number of arithmetic computations. To see how this is possible, consider that the storage space requirements of a dense m n matrix A is mn memory cells. An LRA of A of rank-k for k min(m, n) is defined by the factorization: A B C (2.1) such that B R m k and C R k n. Therefore, the memory storage cost of the LRA of A is O((m + n)k) and we have that ((m + n)k) m n. In terms of a matrix vector product involving the original matrix A of the form: x = Av (2.2) where x and v are m and n dimensional vectors, respectively, this operation requires mn multiplications and m(n 1) additions. Therefore, the number of arithmetic operations is roughly 2mn. A matrix vector product formulation with the LRA of A is: x BCv (2.3) 3

4 We can see that the product y = Cv requires kn+k(n 1) operations while the product x = By uses mk + m(k 1) operations. The overall arithmetic operation complexity is O(k(m + n)) and offers a significant savings when k min(m, n). If the matrix vector product can be left in the rank factorization form: x By (2.4) then the number of operations is further reduced. The metric by which we prefer to measure the accuracy of a rank-k LRA, Â k, of A is the (1 + ɛ) relative-error bound for small positive ɛ of the following form for both Spectral and Frobenius norms: A Âk (1 + ɛ) A A k (2.5) The relative-error norm bound is a particular example of a multiplicative error bound. A k is the theoretical best rank-k approximation of A that is given by the Singular Value Decomposition which we discuss next. 2.2 Singular Value Decomposition (SVD) Though the origins of the Singular Value Decomposition (SVD) can be traced back to the late 1800 s, it was not until the 20th century that it eventually evolved into its current and most general form. The SVD exists for any m n matrix A regardless of whether its entries are real or complex. Let A be an m n matrix with r = rank(a) whose elements may be complex. Then there exists two unitary matrices U and V such that A = UΣV (2.6) where U and V are m m and n n, respectively. Σ is an m n diagonal matrix with nonnegative elements σ i such that σ 1 σ 2 σ r > 0 and σ j = 0 for j > r. We may also write a truncated form of the SVD in which U consists of the r left-most columns of U, V is similarly r n, and Σ is diag(σ 1,..., σ r ). We write this truncated form as follows: A = U r Σ r V r (2.7) In this truncated form the columns of U and V form orthogonal bases for A and A, respectively. The σ 1,..., σ r are commonly referred to as the singular values of A. More importantly, these singular values indicate the lower bounds on the error of any rank-k approximation of A in the Spectral and Frobenius norms. We have for A k = U k Σ k Vk that: A A k 2 = σ k+1 (2.8) 4

5 A A k F = min(m,n) j=k+1 σ 2 j (2.9) The SVD plays a dual role with regards to matrix approximation. Firstly, algorithms exist to compute the decomposition with asymptotic cost of O((m + n)mn) [6] from which we may obtain a rank-k approximation for k = 1,..., r 1. For the special case of k = 0 we have that A = σ 1. Secondly, the rank-k approximation from the SVD is utilized as the optimal rank-k approximation for evaluating and comparing decompositions and the approximation algorithms which produce them. It is indeed the high cost of producing the SVD factorization for MMDS s that has motivated the search for new LRA techniques. We also note that from an SVD representation of A we may write it s low rank format by first forming the product (Σ k V k ): Alternatively, we may also write: A k = U k (Σ k V k ) (2.10) A k = (U k Σ k ) V k (2.11) Starting with the QR factorization, we present other important classical decompositions in subsequent sections and algorithms that generate them. The SVD decomposition may also be obtained from this QR factorization with an additional post-processing step to the QR algorithm. 2.3 QR Decomposition The QR decomposition is a factorization of a matrix A into the product of two matrices, a unitary matrix Q providing an orthogonal basis for A and an upper triangular matrix R. The significance of this decomposition is evident from its usage as a preliminary step in determining a rank-k SVD decomposition of an m n matrix A. The QR decomposition itself can be obtained faster than the SVD in O(mn min (m, n)) time. More formally, let A be an m n matrix with m n whose elements may be complex. Then there exists an m n matrix Q and an n n matrix R such that A = QR (2.12) where the columns of Q are orthonormal and R is upper triangular. Column i of A is the linear combination of columns of Q with the coefficients given by column i of R. In particular, by the upper triangular form of R, it is clear that column i of A is determined from the first i columns of Q. The existence of the QR factorization can be proven in a variety of ways. We present here a proof using the Gram-Schmidt procedure: Suppose (a 1, a 2,..., a n ) is a linearly independent list of vectors in an inner product space V. Then there is an orthonormal list of vectors (q 1, q 2,..., q n ) such that: span(a 1, a 2,..., a n ) = span(q 1, q 2,..., q n ). (2.13) 5

6 Proof: Let proj(r, s) := <r,s> <r,r> r denote the projection of r on to s and apply the following steps to obtain the orthonormal list of vectors: w 1 := a 1 w 2 := a 2 proj(a 2, w 1 ). w n := a n proj(a n, w 1 ) proj(a n, w 2 ) proj(a n, w n 1 ) q 1 := w 1 / w 1, q 2 := w 2 / w 2,..., q n := w n / w n Re-arranging equations for w 1, w 2,..., w n to be equations with a 1, a 2,..., a n on the left-hand side and replacing w i with q i gives A = QR where A = [a 1, a 2,..., a n ] Q = [q 1, q 2,..., q n ] < q 1, a 1 > < q 1, a 2 > < q 1, a 3 >... < q 1, a n > 0 < q 2, a 2 >... < q 2, a n 1 > < q 2, a n > R = 0 0 < q 3, a 3 >... < q 3, a n > < q n, a n > A problem with practical application of the Gram-Schmidt procedure occurs in the case that rank(a) < n. It is necessary in this case to determine a permutation of the columns of A such that the first rank(a) columns of Q are orthonormal. Let P denote the n n matrix representing this column permutation. We have the QRP formulation: A = QRP (2.14) Enhancements to the basic QR algorithm to obtain both the QR factorization and the permutation matrix are known as QR with column-pivoting. A further improvement discerns the rank of matrix A though at a higher cost. The goal is to find the permutation matrix P for the construction of R such that: ( ) R11 R R = 12 0 R 22 (2.15) Then If R 22 is small and R 11 is r r, it can be shown that σ r+1 (A) R 22 implies that rank(a) = n r. These constructions are known as rank revealing QR (RRQR) factorizations and are the most commonly used forms of the QR algorithm in use today. A deterministic RRQR algorithm was given by Gu and Eisenstat in their seminal paper [5] that finds a k-column subset C of the input matrix A such that the projection of A on to C has error relative to the best rank-k approximation of A as follows: A CC A k(n k) A A k 2 (2.16) The above result matches the classical existence result of Ruston[7]. referred to [5] for more information on RRQR factorizations. The reader is 6

7 2.4 Skeleton (CUR) Decomposition A different approach to LRA is one that determines a subset of actual rows and columns of the input matrix as factors in an approximation instead of an orthogonal matrix. The CUR decomposition consists of a matrix C of a subset of columns of the original matrix A and a matrix R containing a subset of rows of A. U is a suitably chosen matrix to complete the decomposition. The problem described in this section is the submatrix selection problem. Let A be an m n matrix of real elements with rank(a) = r. Then there exists a nonsingular r r matrix Â in A. Moreover, let I be and J be the sets of row and column indices of A, respectively, in Â, such that C = A(1..m, J) and R = A(I, 1..n). For U = Â 1, we have that: A = CUR (2.17) Therefore, it is clear that a subset of both r columns and r rows captures A s column and row spaces, respectively. This skeleton is in contrast to the SVD s left and right singular vectors which are unitary. While it is NP-hard to find optimal row and column subsets, an advantage of this representation is that its content is conducive to being understood in application terms and domain knowledge. Moreover, the CUR decomposition may preserve structural properties of the original matrix that would otherwise be lost in the admittedly somewhat abstract decompositional reduction to unitary matrices. On the other hand, it is not guaranteed that Â is well-conditioned. The CUR decomposition requires O((m + n + r)r) memory space and may be simplified to a rank factorization format by writing GH = CU R where G = CU and H = R, or G = C and H = UR. We shall see in a later section that the work of researchers from both Numerical Linear Algebra (NLA) and Theoretical Computer Science (TCS) have provided different algorithmic approaches to LRA employing the CUR decomposition. NLA algorithms have focused on the particular choice of an Â that maximizes the absolute value of its determinant while TCS favors column and row sampling strategies based on sampling probabilities derived from Euclidean norms of either the matrix s singular vectors or of it s actual rows and columns. It is the construction of the sampling probabilities from singular vectors, commonly known as leverage scores, which is responsible for the computational complexity bound in the TCS approach. In the NLA algorithms the absolute value of the determinant is a proxy for quantifying the orthogonality of the columns in a matrix. This topic will be discussed in more detail in a later section. A variation of the above CUR strategy is to obtain a rank-k matrix C from columns of the original matrix A and project A on to C. This approximate decomposition is given by A CC A and is known as a CX decomposition where X := C A. The key idea is to project the matrix A onto a rank-k subspace of A as given by C. Thus, a rank-k factorization may be given by GH = CC A where G = C and H = C A. In the next section we present a related form of decomposition known as the Interpolative Decomposition. 2.5 Interpolative Decomposition (ID) The intuition motivating the ID is that if an m n matrix A has rank k, than it is reasonable to expect to be able to use some representative subset of k columns of A (let s call this column subset B) to represent all n columns of A. In effect, the columns of B serve 7

8 as a basis of A. Consequently, we only need to construct a k n matrix P to express each column i of A for i = 1... n as a linear combination of the columns of B. This intuition leads us to the Interpolative Decomposition Lemma: Suppose A is an m n matrix of rank k whose elements may be complex. Then there exists an m k matrix B consisting of a subset of columns of A and a k n matrix P such that: 1. A = B P 2. The I k matrix appears in some column subset of P 3. p ij 1 for all i and j To find the subset of k columns from a choice of n columns is NP-hard and algorithms based on the above conditions can be expensive. But the computation of the ID is made easier [1] by relaxing the requirement of p ij 1 to p ij 2. The B factor of the ID, as with the C and R matrices of the CUR decomposition, facilitates data analysis and it inherits properties of the matrix A. We may ask if the ID can be extended into the form of a two-sided ID decomposition where the rows of B are a basis for the rowspace of A? The existence of such a decomposition is given in [8] with the Two-sided Interpolative Decomposition Theorem: Let A be an m n matrix and k min(m, n). Then there exists: ( ) Ik ( A = P L A S S Ik T ) PR + X (2.18) such that P L and P R are permutation matrices. S C (m k) k and T C k (n k) and X satisfy: S F k(m k) (2.19) T F k(n k) (2.20) X 2 σ k+1 (A) 1 + k(min(m, n) k) (2.21) In the above formulation A S is a k k submatrix of A. Though we will not investigate this decomposition any further in this survey, we mention it here to point out that this CURlike decomposition includes a residual X term that is bounded by the (k + 1) singular value of A. To some extent we may infer an increased difficulty of the submatrix selection problem versus that of just column subset selection in the one-sided Interpolative Decomposition. 2.6 QR Conventional Algorithm and Complexity Cost After having described some of the more prominent decompositions in the prior sections, we turn our attention to the conventional algorithms commonly utilized to produce them. We will limit our presentation to the QR and SVD factorizations for a couple of reasons. First, these are two of the most ubiquitous factorizations in practice and much effort has been invested over the years to enhance the performance and functionality of their algorithms. 8

9 Moreover, an analysis of the techniques used in their production are sufficient to convey the limitations and inflexibilities that motivate the search for more robust algorithmic approaches. Let s first present a method for generating the Q and R factors of A which addresses computational issues arising from the Gram-Schmidt procedure. As a QR Gram Schmidt alternative, consider an orthogonal matrix product Q 1 Q 2... Q n that transforms A to upper triangular form R: (Q n... Q 2 Q 1 )A = R A multiplication of both sides by (Q n... Q 2 Q 1 ) 1 yields: (Q n... Q 2 Q 1 ) 1 (Q n... Q 2 Q 1 )A = (Q n... Q 2 Q 1 ) 1 R A = Q 1 Q 2... Q n R Note that a product of orthogonal matrices is also orthogonal so allowing for columnpivoting we have that: AΠ = Q 1 Q 2... Q n R (2.22) A Householder reflection matrix is used for each Q i, i = 1, 2,..., n to transform A to R column-wise. More formally, the Householder matrix vector multiplication Hx = (I 2vv T )x reflects a vector x across the hyperplane normal to v. The unit vector v is constructed for each Q i Householder matrix so that entries of column i below the diagonal of A vanish. 1. x = (a ii, a ii+1,..., a in ) for column i 2. v depends upon x and the standard basis vector e i 3. The matrix product Q i A is applied 4. The above items are repeated for each column of A The impacts to the QR algorithm are that Householder matrices improve numerical stability through the multiplication by orthogonal matrices. The chain of Q i s are not collapsed entirely together in a manner that would result in just one matrix-matrix multiplication operation. It is also possible that the matrix product Q i A is implemented as a series of matrix vector multiplications instead. While parallelized deployments of the QR algorithm are utilized, the parallelization on a massive scale that we seek for MMDS s is not suited to the above described algorithmic enhancements. We mention here that there also exists a variation of the QR algorithm that uses another type of orthogonal transformation, the Givens rotation, though the Householder version is more commonly used. A more detailed discussion on these QR algorithms may be found in one of the popular Linear Algebra textbooks such as [6]. We next turn our attention to algorithms for computing an SVD decomposition. 9

10 2.7 SVD Deterministic Algorithm and Complexity Cost The standard SVD algorithm is based on the work of Golub and Reinsch [3] though we will also review an alternative algorithm that uses a QR algorithm with post-processing. The SVD decomposition of A = UΣV occurs in two distinct steps: 1st Step: Use two sequences of Householder translations to reduce A to upper bidiagonal form: 1. B = Q n... Q 2 Q 1 AP 1 P 2... P n 2 2. Therefore, we have that: A = Q 1 Q 2... Q n BP n 2... P 2 P 1 2nd Step: Use two sequences of Givens rotations (orthogonal transformations) to reduce B to diagonal form Σ 1. Σ = G n 1... G 2 G 1 BF 1 F 2... F n 1 2. Likewise, we have that: B = G 1 G 2... G n 1 ΣF n 1... F 2 F 1 3. Set U := Q 1 Q 2... Q n G 1 G 2... G n 1 4. Set V := (F 1 F 2... F n 1 ) (P 1 P 2... P n 2 ) A truncated SVD to a rank-k approximation may be obtained from running this algorithm though there is no savings available in the arithmetic complexity. According to [28] the first step of bidiagonal reduction (BRD) can consume at least seventy percent of the time for the SVD algorithm. In practice, BRD consists of the repeated construction of the Householder reflectors and update of the matrix using the reflectors. Two matrix vector multiplications during each reflector construction involve the remaining subdiagonal portion of the matrix being reduced. Depending upon the implementation, if the sequence of matrix updates and matrix vector multiplications result in frequent data transfers across the memory hierarchy, a memory bottleneck may result. The situation may be exacerbated for matrices that are larger than available cache. It should be clear that this has implications for processing MMDS s and motivates, in part, the search for new LRA strategies. There exists an alternative to the above standard SVD algorithm which relies on first obtaining a rank-k QR factorization. In this case we have that: A = Q 1 Q 2... Q n RΠ + E (2.23) where E is a residual error term. The SVD algorithm described above may be applied to the product RΠ with the result: RΠ = XΣV (2.24) Let Q = Q 1 Q 2... Q n and noting that the product U = QX is also orthogonal, we have the following rank-k SVD decomposition of A: A = UΣV + E (2.25) While we have used the SVD algorithm it is only applied on a matrix of typically much smaller dimension than A. 10

11 3 Approximation and Probabilistic Results In recent years a number of theoretical results have appeared in the literature concerning the approximation of a matrix by one of a smaller size in terms of rows and/or columns. Three broad strategies have been identified by which we may seek to reduce matrix size. The first of these is dimension reduction in which the goal is to approximate a matrix by one of much smaller rank than the original matrix. A second approximation strategy is to choose a subset of columns (or rows) from the original matrix which are most representative of the original matrix and thereby preserving the salient characteristics of the original matrix in the approximation. In the third strategy a submatrix consisting of a subset of both rows and columns of the original matrix is chosen to formulate an approximate matrix of smaller size. Each of these strategies have received considerable attention by researchers and more often than not, from both Theoretical Computer Science and Numerical Linear Algebra perspectives. An informative and in-depth comparison of these two points of view and their cultural differences may be found in [2]. This section covers the three broad strategies and, in particular, the random multiplier matrices utilized in dimension reduction. 3.1 Dimension Reduction It is instructive to begin the discussion of Dimension Reduction, as it applies to matrices, with a related topic concerning points in Euclidean space. A seminal paper by Johnson and Lindenstrauss [9] proved that, given a set of n points of dimension d, it is possible to approximate distances between any two points in O(log n)-dimensional space. Let X 1, X 2,... X n R d. Then for ɛ (0, 1) there exists Φ R k d for k = O( 1 ɛ 2 log n) such that: (1 ɛ) X i X j 2 ΦX i ΦX j 2 (1 + ɛ) X i X j 2 (3.1) That is, the target dimension depends only upon the number of points. The immediate consequences of this result were obvious and significant for the Nearest Neighbor problem, though not readily apparent to the NLA community. The mapping matrix Φ (J-L Transform) can be constructed as a random Gaussian normal k d matrix. Alternatively, Achlioptas [10] demonstrated that a matrix obtained from Bernoulli random {+1, 1} entries could also be used. Perhaps even of more importance was their finding that sparsity could be introduced into the random matrix thereby achieving reduced matrix vector multiplication complexity. In this case {+1, 1} values are each chosen with probability p = 1 6 and zero otherwise. Sarlos [11] utilized the Johnson-Lindenstrauss lemma to provide the first relative-error approximation in terms of the Frobenius norm for LRA in a constant number of passes of a matrix building on Achlioptas result. If A R m n and B is an r n J-L transform with i.i.d. zero mean entries { 1, +1} for r = Θ( k ɛ + k log k) and ɛ (0, 1), then with probability p.5, we have that: A P roj AB T,k(A) F (1 + ɛ) A A k F (3.2) where P roj AB T,k(A) is the best rank k approximation of the projection of A in the column space of AB T. This result extended the preservation of distance metrics for vectors 11

12 to that of the actual matrix subspace structure utilizing J-L transforms. Moreover, it suggests a general two step general strategy for dimension reduction. In the first step a random subspace is created from the application of the J-L transform to the matrix A. A rank-k approximation of A is then obtained in the second step after projecting A on to the subspace generated in the first step. Thus, the randomization as given in step 1 for the J-L transform construction enables a new approach to SVD approximation with constant probability. We cover the important topic of random multiplier matrices as given by J- L transforms in more detail in the next section. Another important implication is that to arrive at a rank-k approximation of A, we must formulate r > k random linear combinations of columns of A. Sarlos result was actually preceded in the NLA literature by Papadimitriou et al. [12] which initially proposed using random projections in Latent Semantic Indexing (LSI) applications. Briefly, LSI is concerned with information retrieval and the evaluation of the spectral properties of term document matrices which capture documents on one matrix dimension and the terms found in those documents along the other dimension. Each entry in the matrix contains a count of the occurrences of the particular term for a given document. Papadimitriou s result provides a weaker additive error bound than Sarlos in terms of a rank-2k approximation Â2k of a matrix A: A Â2k 2 F A A k 2 F + 2ɛ A 2 F (3.3) The weakness in the additive error bound is due to the second term on the right hand side of the above result because A F can be arbitrarily large. Nonetheless, Papadimitriou provided mathematical rigor in explaining why LRA can be used to capture the salient features of term document matrices. Eventually, a relative-error bound in the spectral norm was given by Halko et al. [1] which relies on a power iteration to obtain the following result: Let A R m n. If B is an n 2k Gaussian matrix and Y = (AA ) q AB such that q is a small non-negative integer and 2k is the target rank approximation where 2 k 0.5min{m, n} then: E A P roj Y,2k (A) 2 [ {min(m, n} k 1 ] 1 2q+1 A Ak 2 (3.4) A power iteration factor (AA ) q appears in Y to address any case of slow decay in the singular values of A that might otherwise negatively affect the LRA accuracy. Thus, the accuracy of the approximation can be refined by a larger choice of q. It can be shown that the SVD of (AA ) q A preserves the left and right singular vectors of A. The singular value matrix of (AA ) q A is Σ 2q+1 where Σ is the singular value matrix of A. In practice the approach employed in [1] for most input matrices in practice does not utilize a power iteration (eq., q = 0). Instead, an oversampling parameter p, a small positive integer, is added to the rank-k value desired to specify the size of an n (k + p) random multiplier matrix. The choice of value for p involves a number of factors. Please see [1] for more details. An improvement to their proof was given by Woodruff [13] that realizes an actual rank k approximation. Woodruff also refined the proof using results for bounds on maximum and minimum singular values of Gaussian random matrices [14]. 12

13 From the Relative-Error bound for a matrix A we have seen that given a sample of r random linear combinations of columns of A, we may obtain a rank k < r approximation of A. Perhaps a more insightful explanation is to recognize the following: 1. Multiplying A by a random vector x yields a vector y colspace(a). 2. With high probability a set of r such y s are linearly independent. 3. A new approximate basis Â for A consisting of the y s has dimension r. 4. If A is projected on to Â, a rank k matrix decomposition of this projection approximates the truncated rank-k SVD decomposition of A. The most expensive aspect of the random projection approach is the multiplication of A by a random multiplier matrix. On the one hand, matrix multiplication is an embarrassingly parallel operation, but the concern of memory bottlenecks is raised with MMDS s and the type of random multiplier involved which can have an adverse impact on clock time performance. We shall see in the next section that we may use structured random multipliers besides Gaussian matrices. Structured matrices are beneficial in reducing the number of floating point operations (FLOPS) as compared to Gaussian matrices and in the amount of storage that they require. 3.2 Subspace Projections with Random Multiplier Matrices In the prior subsection the discussion of J-L transforms described Gaussian random matrices or matrices of Bernoulli random {+1, 1} entries. Unfortunately, matrix vector and matrix matrix multiplication using such dense matrices is expensive. In the case of matrix vector multiplication with a dense R k d matrix, O(kd) arithmetic operations are required with each vector X R d. One alternative involves the sparsification of the J-L transform and it was first proposed by Achlioptas [10]. In this sparse variant of the J-L transform, each element is chosen from a probability distribution where {+1, 1} are each chosen with 1 6 probability and zero is selected with 2 3 probability. A scaling constant 3 k completes the definition of this J-L transform. While this approach can be effective for dense vectors, it is problematic when the vector itself is also sparse. While researchers have focused attention in recent years on J-L transforms for the specific case of sparse vectors, we concern ourselves for the remainder of this section with the general case. The next significant result was the introduction of the Fast Johnson-Lindenstrauss Transform (FJLT) by Ailon and Chazelle [15]. Their efforts addressed the limitations of processing sparse vectors while reducing the complexity of dense matrix vector multiplication. They introduced a transform, a random structured matrix, defined as the product of three matrix factors in which two of the matrices are randomized and the third is the Hadamard matrix. Let the FJLT Φ = P HD and d = 2 l such that: P R k d H, D R d d 13

14 P ij N(0, q 1 ) with probability q and P ij = 0 with probability 1 q for q = min(θ( log2 n d ), 1) H 2 = ( d 1 2 d 2 1 d 1 2 d 1 2 ) for q = 2 h, h = 0, 1,... l D is diagonal with D ii drawn independently from {1, 1} with probability 1 2. Then we have that with probability 2 3 for X i R d : (1 ɛ)k X i 2 Φ 2 (1 + ɛ)k X i 2 (3.5) To build Φ requires O(d log d + min(dɛ 2 log n, ɛ 2 log 3 n)) operations. Moreover, the complexity of matrix vector multiplication is O(d log d + P ). The motivation for the FJLT is to have a matrix with sparsity as proposed by Achlioptas and that negates the sparse vector scenario while reducing arithmetic compexity of multiplication. The H and D matrices are orthogonal matrices. Therefore, matrix vector multiplication preserves vector norms and distances between vectors. According to [15], H densifies sparse vectors while D provides enough randomization to prevent dense vectors from becoming sparse. The P matrix provides sparsity to the transform similarly as in [16]. The structure inherent in the FJLT as given by the recursive definition of H provides for the improved matrix vector multiplication complexity rather than the sparsity given in P. Woolfe et al. [16] subsequently applied the FJLT to LRA by formulating the subsampled random Fourier transform (SRFT) in the complex case. Let the SRFT Φ = n l DF S such that: D C n n is a diagonal matrix whose entries are i.i.d. random variables distributed uniformly on the unit circle. F is the n n Discrete Fourier Transform (DFT) matrix. S is an n l matrix whose columns are sampled uniformly from the n n identity matrix. Therefore, a random subspace may be created from an m n matrix using the SRFT as a random multiplier. In comparison to a Gaussian multiplier that requires nl random entries, the SRFT requires only (n + l) random entries: n entries for the D matrix and l for the matrix S. The SRFT matrix matrix multiplication is performed using O(mn log l) flops. In their algorithm the accuracy of a rank-k approximation Âk of A C m n for real α, β greater than 1 such that m > l α2 β (α 1) 2 (2k) 2 and with probability p 1 3 β is: A Âk 2( 2α 1 + 1)( α max (m, n) α max (m, n)) A A k 2 (3.6) A similar random structured matrix may be formed with the DFT matrix replaced by a Hadamard matrix and the diagonal matrix D consisting of randomly chosen {1, 1} on the diagonal as in the SRFT case. The primary drawback of SRFT s compared to 14

15 Gaussian matrices is a theoretically higher probability of failure [1]. For Gaussian matrices the probability of failure with oversampling parameter p is e p and that of SRFT s for a rank-k approximation increases to 1 k. We may ask what other types of random structured matrices can be used that are perhaps faster than SRFT for Dimension Reduction? The key to answering this question lies in a change in the definitions of the input matrix A and the random Gaussian multiplier B used to generate a random subspace in the first step of Dimension Reduction. According to the Dual Theorem in Pan et al. [17], assume that A R m n is an average input matrix with numerical rank at most r under the Gaussian probability distribution and that B R n l has numerical rank l. If l r then Dimension Reduction using B succeeds in outputting a rank l approximation to A. It implies that a unitary matrix B or a matrix that is both full-rank and well-conditioned (reasonably bounded condition number) is sufficient. Recall that the condition number of a unitary matrix is equal to one. This result suggests that we can expect to have success with random multipliers that are formed using structured and sparse orthogonal matrices. Moreover, as it concerns MMDS s we also benefit from the use of structured multipliers by reducing the memory space needed for their storage. Therefore, the possibility exists to find more efficient random multipliers, that is, those having lower complexity bound than O(mn log l) flops for matrix matrix multiplication as in SRFT. One such possibility is to employ Hadamard and Fourier matrices that are defined up to a few recursive levels, thus inducing a sparse and orthogonal matrix. Indeed the numerical experiments in [17] show that random multipliers using these abridged and orthogonal Hadamard matrices in place of a full Hadamard matrix are promising. Currently, there is no formal support for this specific type of matrix. Though it should not be surprising that orthogonal matrices should be effective multipliers given that matrix vector multiplication with them preserves vector norms as well as the distances and angles between vectors. 3.3 Approximations with Column(or Row) Subsets We now turn our attention to the problem of identifying a suitable subset of columns (or rows) of a matrix A R m n that may also be optionally processed further in some manner in order to obtain an approximation A. We previously saw a classical result in an earlier section concerning existence of a k-column subset C of A such that: A CC A k(n k) A A k 2 (3.7) The CC A term represents a CX approximation to A for X := C A in which A is projected on to the column space of C by the projection matrix CC. An influential paper by Frieze et al. [19] introduced a strategy of creating sampling probabilities from the Euclidean norms of the rows and columns of a matrix from which to subsequently sample a subset of columns and rows. Their key theoretical finding assumes the probabilities P i for rows A (i), i = 1... m and a constant c 1 such that: P i c A (i) 2 A 2 F (3.8) Theorem 1. [19]: Let R be a sample of r rows of A chosen from the above distribution and let W be the vector space spanned by R. There exists with probability p.9 an orthonormal set of vectors w (1), w (1),... w (k) in W such that: 15

16 k A A w (i) w(i) T i=1 2 A A k 2 F + 10k cr A 2 F (3.9) The authors applied Theorem 1 to provide an algorithm that samples a subset of columns and rows of A to form a rank-k approximation Âk of A with additive error bound: A Âk 2 F A A k 2 F + ɛ A 2 F (3.10) The additive error bound is weak in the sense that A 2 F in the second term on the righthand side may be arbitrarily large. While the algorithm has polynomial time complexity in k and 1 ɛ, the complexity bound does not include the computation of the sampling probabilities for the rows and columns of A and the sample complexity (number of rows) is O(k 4 ). Otherwise, it is interesting to note that the running time is independent of the matrix size. Subsequently, Deshpande et al [20] improved upon the above result of [19] utilizing a volume-sampling technique to generate a multiplicative error bound that is more refined than its additive counterpart. They showed that there exists a set of k rows of A whose span contains a subset of rows Ãk that is a multiplicative approximation to the best rank-k matrix approximation, A k : A Ãk F (k + 1) A A k F (3.11) Moreover, they extended this result into a stronger relative error approximation in the following theorem. Theorem 2. [20]: In any m n matrix A there exists O( k2 ɛ ) rows in whose span are rows that form a rank-k matrix Ãk for an error parameter ɛ such that: A Ãk 2 F (1 + ɛ) A A k 2 F (3.12) The volume sampling method utilized in this paper relies on volume distributions constructed for each k-subset of rows of A. The volume of a matrix B containing k rows is: vol(b) = 1 k! det(bb T ) (3.13) Thus, a k-row subset is chosen with probability proportional to the square of its volume. The improved error-bound approximation of Deshpande over that of [19] can at least in part be attributed to a volume metric that captures information about a matrix as opposed to that of the Euclidean vector norms associated to individual rows and columns of a matrix. However, Frieze s algorithms involves only two passes while Deshpande s requires multiple passes to obtain the relative error approximation. Both of the algorithms of Deshpande [20] and Frieze [19] have sample complexity that is at least quadratic for CX approximation. This complexity bound was subsequently reduced by Rudelson and Vershynin[21] in an approach using the Law of Large Numbers adapted to matrices. The rationale underlying their work is that if a matrix has small numerical rank than a low rank approximation 16

17 should be available from a random submatrix. Though they only obtain an additive error approximation, it is done with an algorithm using at most two passes of the data and with O(k log k) sample complexity. Their additive error is given in the spectral norm. Theorem 3. [21]: Suppose A is an m n matrix of numerical rank r = A F 2 and ɛ, δ (0, 1), c 0, and d an integer such that: ( ) ( ) r r m d c ɛ 4 log δ ɛ 4 δ A 2 2 (3.14) Let a random submatrix Ãk of d rows of A be sampled according to their squared Euclidean norms and let U k be the k top left singular vectors of Ã k. We have that with probability p 1 2 exp c δ : A AUU 2 A A k 2 + ɛ A 2 (3.15) Another point of interest is that the sample size d depends upon the numerical rank and not the desired rank-k value for the approximation as in [19, 20]. Finally, an algorithm that realizes the CX relative error existential result of Deshpande et al. [19] was given by Drineas et al. [22]. However, they take a different approach from previous papers concerning the construction of sampling probabilities. Recall that the previous sampling probabilities given in the literature utilized the squared Euclidean norms of the rows (or columns) of a matrix A. Drineas introduced the idea of subspace sampling according to the squared norms of the right singular vectors of A. Their argument is that this is an improvement over the previous column (row) sampling of the matrix due to linear span considerations. Suppose that the i-th column of a matrix A is given by: A (i) = UΣ(V T ) (i) (3.16) Therefore, V (i) is in some sense a metric of the extent to which A (i) is contained in the span of U. The effect of Σ is eliminated as compared to the previous probability distribution approach because it does not affect U s span. Consequently, the probability distributions p i for i = 1... n, otherwise known as leverage scores, associated to each column i of the best rank-k approximation A k are given by: ( p i V T (A,k) k ) (i) 2 2 (3.17) The theorem for the LRA algorithm follows: Theorem 4. [22]: Suppose A is an m n matrix containing real entries and k min(m, n) is an integer. Then there exists randomized algorithms that choose a c column k subset, C, of A where c = O( 2 log 1 δ 1 δ: ɛ 2 ) such that for ɛ, δ (0, 1], we have that with probability A CC + A F (1 + ɛ) A A k F (3.18) 17

18 ( k log k log 1 ) δ A similar result holds if at most c = O columns are chosen in expectation. ɛ 2 The complexity bound to build the CX in this algorithm is bounded by the cost required to derive the right singular vectors of the rank-k approximation. A less expensive alternative is to obtain approximate leverage scores. These may be derived according to a relative-error bound approximation in [23] with cost O(mn log k) to obtain the sampling probabilities corresponding to the top k singular vectors. An area of further research concerns the preconditioning of matrix A by a suitable multiplier matrix such that the leverage scores of the product matrix are approximately uniform. In this case, the sampling algorithm includes a post-processing recovery step of the original matrix. This strategy can be justified if the cost of the two matrix multiplications can be done inexpensively and it implies using orthogonal matrices as their inverses are readily available. A randomized algorithm that satisfies Theorem 4 is presented in section Approximations with Column-Row Subset Combinations We now extend discussion of the column (row) subset appproximation results of the prior section to that of approximations using column and row subsets simultaneously. Such approximations in the TCS community are commonly referred to as CUR decompositions while among NLA researchers the terms CGR and matrix skeletons are also used. An investigation as it concerns adapting the sampling approach of the last section to CUR can be found in Drineas et al. [24]. Their algorithms are devised with the goal of accommodating out-of-core massive data sets so that only O(cm + nr) RAM is required to obtain a CUR for A R m n, C R m c, and R R r n. Moreover, at most three passes of the matrix A are required. Their linear time algorithm, as with the Column Subset algorithms of the prior section, relies on sampling probabilities. These probabilities are proportional to the squared Euclidean norms p i and q j for the sets of rows and columns, respectively, of A. Thus, C is computed from sampling c columns of A according to the column probabilities given by q j. Likewise, R is constructed from r sampled rows of A according to the row probabilities, p i. The column and row subsets are scaled and the matrix U is derived from additional processing of C and R. More formally, we have the following sampling probabilities: p i = A (i) 2 A 2 F i = 1... m (3.19) q j = A(j) 2 A 2 F j = 1... n (3.20) The algorithm has cost complexity of O(max (m, n)) and requires one pass of A to compute the probabilities and a second pass in which the matrices C and R are simultaneously obtained. An additive error in expectation is given as follows for 1 k min (c, r) provided that c 64k ɛ 4 and r k ɛ 2 : E[ A Âk F ] A A k F + ɛ A F (3.21) E[ A Âk 2 ] A A k 2 + ɛ A F (3.22) 18

19 Though the algorithm is polynomial in k and 1 ɛ, it is linear in the input matrix dimensions. It is assumed that the desired rank-k is much smaller than min (m, n) and that c and r are sufficiently small to be considered constants. The additive error bounds for CUR were subsequently improved to relative error bounds by Drineas et al [22] by extending the relative error bound result for CX decompositions of the same paper. Once again they use a modified sampling probability strategy based on the singular vectors of the input matrix. The complexity of the algorithm is bounded by the time required to compute the squared Euclidean norms (or approximations thereof) of the singular vectors. A rough sketch of the algorithm is presented here and it is discussed in more detail in section 4. The matrix C, a column subset of A, is generated by the algorithm of Theorem 4. The left singular vectors U C of C are then used to compute probabilities q i for the sampling of r rows from A to form R. The U matrix is the pseudoinverse of the matrix W R r c that is the intersection of R with C. Both R and W are scaled similarly to C. ( q i = U T C c ) (i) 2 2 i = 1... m (3.23) The key theorem that enables the relative error bound for CUR follows: Theorem 5. [22]: Suppose A is an m n matrix containing real entries and C is an m c matrix containing c columns of A obtained with the algorithm of Theorem 4. Set r = 3200 c2 for ɛ (0, 1], and choose r rows of A (and the corresponding ones in C) as in ɛ 2 the algorithm described immediately above. Then we have with probability p.7 that: A CUR F (1 + ɛ) A CC + A F (3.24) ( A similar result holds if at most r = O c log c ɛ 2 ) rows are chosen in expectation. Please see [22] for further details. Theorems 4 and 5 can now be combined to get the final CUR relative error bound as given in Section 5.2 of [22] where ɛ p = 3ɛ: A CUR F (1 + ɛ) A CC + A F (1 + ɛ) 2 A A k F (3.25) (1 + ɛ) 2 A A k F (1 + ɛ p ) A A k F (3.26) According to [22], sampling according to the probabilities q i as defined above is done so that R may contain those rows that capture a similar subspace as the first c right singular vectors of C. The running time of the algorithm, omitting the CX decomposition algorithm of Theorem 4 to obtain C, is O(mn). The requirements of O(k log k ) columns ɛ 2 and O(c log c ) rows for a rank-k CUR approximation were subsequently lowered in a paper ɛ 2 by Boutsidis and Woodruff [25]. Though [25] is conceptually similar to [22] in the overall approach, Boutsidis and Woodruff employ approximation and sparsification results from the literature to obtain an algorithm with running time that is O(nnz(A)) where nnz(a) is the number of nonzero elements of A. Furthermore, their randomized algorithm requires only c = O( k ɛ ) columns and r = O( k ɛ ) rows for a rank-k CUR decomposition. The U matrix is constructed such that rank(u) = k. 19

A fast randomized algorithm for overdetermined linear least-squares regression

A fast randomized algorithm for overdetermined linear least-squares regression Vladimir Rokhlin and Mark Tygert Technical Report YALEU/DCS/TR-1403 April 28, 2008 Abstract We introduce a randomized algorithm