sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)
|
|
- Christine Jones
- 5 years ago
- Views:
Transcription
1 sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0
2 overview Our Contributions: 1
3 overview Our Contributions: A near optimal low-rank approximation for any positive semidefinite (PSD) matrix can be computed in sublinear time (i.e. without reading the full matrix). 1
4 overview Our Contributions: A near optimal low-rank approximation for any positive semidefinite (PSD) matrix can be computed in sublinear time (i.e. without reading the full matrix). Concrete: Significantly improves on previous, roughly linear time approaches for general matrices, and bypasses a trivial linear time lower bound for general matrices. 1
5 overview Our Contributions: A near optimal low-rank approximation for any positive semidefinite (PSD) matrix can be computed in sublinear time (i.e. without reading the full matrix). Concrete: Significantly improves on previous, roughly linear time approaches for general matrices, and bypasses a trivial linear time lower bound for general matrices. High Level: Demonstrates that PSD structure can be exploited in a much stronger way than previously known for low-rank approximation. Opens the possibility of further advances in algorithms for PSD matrices. 1
6 low-rank matrix approximation Low-rank approximation is one of the most widely used methods for general matrix and data compression. 2
7 low-rank matrix approximation Low-rank approximation is one of the most widely used methods for general matrix and data compression. 2
8 low-rank matrix approximation Low-rank approximation is one of the most widely used methods for general matrix and data compression. Closely related to principal component analysis, spectral embedding/clustering, and low-rank matrix completion. 2
9 low-rank matrix approximation Low-rank approximation is one of the most widely used methods for general matrix and data compression. Closely related to principal component analysis, spectral embedding/clustering, and low-rank matrix completion. Important Special Case: A is positive semidefinite (PSD). I.e.. x T Ax 0, 8x 2 R n 2
10 low-rank matrix approximation Low-rank approximation is one of the most widely used methods for general matrix and data compression. Closely related to principal component analysis, spectral embedding/clustering, and low-rank matrix completion. Important Special Case: A is positive semidefinite (PSD). I.e. x T Ax 0, 8x 2 R n. Includes graph Laplacians, Gram matrices and kernel matrices, covariance matrices, Hessians for convex functions. 2
11 optimal low-rank approximation An optimal low-rank approximation can be computed via the singular value decomposition (SVD). 3
12 optimal low-rank approximation An optimal low-rank approximation can be computed via the singular value decomposition (SVD). 3
13 optimal low-rank approximation An optimal low-rank approximation can be computed via the singular value decomposition (SVD). 3
14 optimal low-rank approximation An optimal low-rank approximation can be computed via the singular value decomposition (SVD). A k = argmin ka B:rank(B)=k Bk F 3
15 optimal low-rank approximation An optimal low-rank approximation can be computed via the singular value decomposition (SVD). A k = argmin ka Bk F = B:rank(B)=k s X (A ij B ij ) 2 i,j 3
16 optimal low-rank approximation An optimal low-rank approximation can be computed via the singular value decomposition (SVD). A k = argmin ka Bk F = B:rank(B)=k s X (A ij B ij ) 2 i,j Unfortunately, computing the SVD takes O(nd 2 )time. 3
17 input sparsity time low-rank approximation Recent work on matrix sketching gives state-of-the-art runtimes 4
18 input sparsity time low-rank approximation Recent work on matrix sketching gives state-of-the-art runtimes Theorem (Clarkson, Woodru 13) There is an algorithm which in O(nnz(A)+n poly(k, 1/ )) time outputs N 2 R n k, M 2 R d k satisfying with prob. 99/100: ka NM T k F apple (1 + )ka A k k F. 4
19 input sparsity time low-rank approximation Recent work on matrix sketching gives state-of-the-art runtimes Theorem (Clarkson, Woodru 13) There is an algorithm which in O(nnz(A)+n poly(k, 1/ )) time outputs N 2 R n k, M 2 R d k satisfying with prob. 99/100: ka NM T k F apple (1 + )ka A k k F. When k, 1/ are not too large, runtime is linear in input size. 4
20 input sparsity time low-rank approximation Recent work on matrix sketching gives state-of-the-art runtimes Theorem (Clarkson, Woodru 13) There is an algorithm which in O(nnz(A)+n poly(k, 1/ )) time outputs N 2 R n k, M 2 R d k satisfying with prob. 99/100: ka NM T k F apple (1 + )ka A k k F. When k, 1/ are not too large, runtime is linear in input size. Best known runtime for both general and PSD matrices. 4
21 sublinear time low-rank approximation Theorem (Main Result Musco, Woodru 17) There is an algorithm running in Õ nk 2 time which, given PSD A, outputs N, M 2 R n k satisfying with probability 99/100: ka NM T k F apple (1 + )ka A k k F. 4 5
22 sublinear time low-rank approximation Theorem (Main Result Musco, Woodru 17) There is an algorithm running in Õ nk 2 time which, given PSD A, outputs N, M 2 R n k satisfying with probability 99/100: ka NM T k F apple (1 + )ka A k k F. 4 Compare to CW 13 which takes O(nnz(A)) + n poly(k, 1/ ). 5
23 sublinear time low-rank approximation Theorem (Main Result Musco, Woodru 17) There is an algorithm running in Õ nk 2 time which, given PSD A, outputs N, M 2 R n k satisfying with probability 99/100: ka NM T k F apple (1 + )ka A k k F. 4 Compare to CW 13 which takes O(nnz(A)) + n poly(k, 1/ ). 5
24 sublinear time low-rank approximation Theorem (Main Result Musco, Woodru 17) There is an algorithm running in Õ nk 2 time which, given PSD A, outputs N, M 2 R n k satisfying with probability 99/100: ka NM T k F apple (1 + )ka A k k F. 4 Compare to CW 13 which takes O(nnz(A)) + n poly(k, 1/ ). If k, 1/ are not too large compared to nnz(a), our runtime is significantly sublinear in the size of A. 5
25 lower bound for general matrices For general matrices, (nnz(a)) time is required. 6
26 lower bound for general matrices For general matrices, (nnz(a)) time is required. Randomly place a single entry which dominates A s Frobenius norm. 6
27 lower bound for general matrices For general matrices, (nnz(a)) time is required. Randomly place a single entry which dominates A s Frobenius norm. 6
28 lower bound for general matrices For general matrices, (nnz(a)) time is required. Randomly place a single entry which dominates A s Frobenius norm. Finding it with constant probability requires reading at least a constant fraction of the non-zero entries in A. 6
29 lower bound for general matrices For general matrices, (nnz(a)) time is required. Randomly place a single entry which dominates A s Frobenius norm. Finding it with constant probability requires reading at least a constant fraction of the non-zero entries in A. Lower bound holds for any approximation factor and even rules out o(nnz(a)) time for weaker guarantees. 6
30 lower bound for general matrices For general matrices, (nnz(a)) time is required. Randomly place a single entry which dominates A s Frobenius norm. Finding it with constant probability requires reading at least a constant fraction of the non-zero entries in A. Lower bound holds for any approximation factor and even rules out o(nnz(a)) time for weaker guarantees. ka NM T k F apple (1 + )ka A k k F 6
31 lower bound for general matrices For general matrices, (nnz(a)) time is required. Randomly place a single entry which dominates A s Frobenius norm. Finding it with constant probability requires reading at least a constant fraction of the non-zero entries in A. Lower bound holds for any approximation factor and even rules out o(nnz(a)) time for weaker guarantees. ka NM T k F appleka A k k F + kak F. 6
32 what about for psd matrices? Observation: For PSD A, wehaveforanyentrya ij : a ij apple max(a ii, a jj ) since otherwise (e i e j ) T A(e i e j ) < 0. 7
33 what about for psd matrices? Observation: For PSD A, wehaveforanyentrya ij : a ij apple max(a ii, a jj ) since otherwise (e i e j ) T A(e i e j ) < 0. So we can find any hidden heavy entry by looking at its corresponding diagonal entries. 7
34 what about for psd matrices? Observation: For PSD A, wehaveforanyentrya ij : a ij apple max(a ii, a jj ) since otherwise (e i e j ) T A(e i e j ) < 0. So we can find any hidden heavy entry by looking at its corresponding diagonal entries. Question: How can we exploit additional structure arising from positive semidefiniteness to achieve sublinear runtime? 7
35 every psd matrix is a gram matrix Very Simple Fact: Every PSD matrix A 2 R n n can be written as B T B for some B 2 R n n. 8
36 every psd matrix is a gram matrix Very Simple Fact: Every PSD matrix A 2 R n n can be written as B T B for some B 2 R n n. B can be any matrix square root of A, e.g. if we let V V T be the eigendecomposition of A, wecansetb = 1/2 V T. 8
37 every psd matrix is a gram matrix Very Simple Fact: Every PSD matrix A 2 R n n can be written as B T B for some B 2 R n n. B can be any matrix square root of A, e.g. if we let V V T be the eigendecomposition of A, wecansetb = 1/2 V T. Letting b 1,...,b n be the columns of B, the entries of A contain every pairwise dot product a ij = b T i b j. 8
38 every psd matrix is a gram matrix The fact that A is a Gram matrix places a variety of geometric constraints on its entries. 9
39 every psd matrix is a gram matrix The fact that A is a Gram matrix places a variety of geometric constraints on its entries. The heavy diagonal observation is just one example. By Cauchy-Schwarz: q a ij = b T i b j apple (b T i b i ) (b T j b j )= p a ii a jj apple max(a ii, a jj ). 9
40 every psd matrix is a gram matrix The fact that A is a Gram matrix places a variety of geometric constraints on its entries. The heavy diagonal observation is just one example. By Cauchy-Schwarz: q a ij = b T i b j apple (b T i b i ) (b T j b j )= p a ii a jj apple max(a ii, a jj ). Another View: A contains a lot of information about the column span of B in a very compressed form with every pairwise dot product stored as a ij. 9
41 factor matrix low-rank approximation Question: Can we compute a low-rank approximation of B using o(n 2 ) column dot products? I.e. o(n 2 )accessestoa? 10
42 factor matrix low-rank approximation Question: Can we compute a low-rank approximation of B using o(n 2 ) column dot products? I.e. o(n 2 )accessestoa? Why? B has the same (right) singular vectors as A, andits singular values are closely related: i(b) = p i(a). 10
43 factor matrix low-rank approximation Question: Can we compute a low-rank approximation of B using o(n 2 ) column dot products? I.e. o(n 2 )accessestoa? Why? B has the same (right) singular vectors as A, andits singular values are closely related: i(b) = p i(a). So the top k singular vectors are the same for the two matrices. An optimal low-rank approximation for B thus gives an optimal low-rank approximation for A. 10
44 factor matrix low-rank approximation Question: Can we compute a low-rank approximation of B using o(n 2 ) column dot products? I.e. o(n 2 )accessestoa? Why? B has the same (right) singular vectors as A, andits singular values are closely related: i(b) = p i(a). So the top k singular vectors are the same for the two matrices. An optimal low-rank approximation for B thus gives an optimal low-rank approximation for A. Things will be messier once we introduce approximation, but this simple idea will lead to a sublinear time algorithm for A. 10
45 low-rank approximation via adaptive sampling 11
46 low-rank approximation via adaptive sampling Theorem (Deshpande, Vempala 06) For any B 2 R n n, there exists a subset of Õ(k 2 / ) columns whose span contains Z 2 R n k satisfying: kb ZZ T Bk F apple (1 + )kb B k k F 11
47 low-rank approximation via adaptive sampling Theorem (Deshpande, Vempala 06) For any B 2 R n n, there exists a subset of Õ(k 2 / ) columns whose span contains Z 2 R n k satisfying: kb ZZ T Bk F apple (1 + )kb B k k F Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 11
48 low-rank approximation via adaptive sampling Theorem (Deshpande, Vempala 06) For any B 2 R n n, there exists a subset of Õ(k 2 / ) columns whose span contains Z 2 R n k satisfying: kb ZZ T Bk F apple (1 + )kb B k k F Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 11
49 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12
50 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb = kb i P S b i k 2 P i k 2 n i=1 kb = a i k 2 ii tr(a). 12
51 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb = kb i P S b i k 2 P i k 2 n i=1 kb = a i k 2 ii tr(a). 12
52 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb = kb i P S b i k 2 P i k 2 n i=1 kb = a i k 2 ii tr(a). 12
53 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12
54 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12
55 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12
56 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12
57 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12
58 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12
59 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12
60 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12
61 sublinear dot product algorithm Theorem (Factor Matrix Low-Rank Approximation) There is an algorithm using Õ(nk 2 / ) accesses to A = B T B which computes Z 2 R n k satisfying with probability 99/100: kb ZZ T Bk F apple (1 + )kb B k k F. 13
62 sublinear dot product algorithm Theorem (Factor Matrix Low-Rank Approximation) There is an algorithm using Õ(nk 2 / ) accesses to A = B T B which computes Z 2 R n k satisfying with probability 99/100: kb ZZ T Bk F apple (1 + )kb B k k F. How does this translate to low-rank approximation of A itself? 13
63 boosting to a psd matrix approximation Lemma If kb ZZ T Bk 2 F apple 1+ 3/2 p n kb B k k 2 F,thenforA = BT B: ka B T ZZ T Bk 2 F apple (1 + )ka A kk 2 F. 14
64 boosting to a psd matrix approximation Lemma If kb ZZ T Bk 2 F apple 1+ 3/2 p n kb B k k 2 F,thenforA = BT B: ka B T ZZ T Bk 2 F apple (1 + )ka A kk 2 F. 14
65 boosting to a psd matrix approximation Lemma If kb ZZ T Bk 2 F apple 1+ 3/2 p n kb B k k 2 F,thenforA = BT B: ka ASCS T A T k 2 F apple (1 + )ka A kk 2 F. 14
66 boosting to a psd matrix approximation Lemma If kb ZZ T Bk 2 F apple 1+ 3/2 p n kb B k k 2 F,thenforA = BT B: ka ASCS T A T k 2 F apple (1 + )ka A kk 2 F. This gives a low-rank approximation algorithm which accesses just Õ nk 2 = n 3/2 poly(k, 1/ ) entries of A. 3/2 / p n 14
67 boosting to a psd matrix approximation Lemma If kb ZZ T Bk 2 F apple 1+ 3/2 p n kb B k k 2 F,thenforA = BT B: ka ASCS T A T k 2 F apple (1 + )ka A kk 2 F. This gives a low-rank approximation algorithm which accesses just Õ nk 2 = n 3/2 poly(k, 1/ ) entries of A. 3/2 / p n Our best algorithm accesses just Õ in Õ nk 2 time. 4 nk 2.5 entries of A and runs 14
68 limitations of column sampling Recall that our algorithm accesses the diagonal of A along with Õ(k 2 / ) columns. 15
69 limitations of column sampling Recall that our algorithm accesses the diagonal of A along with Õ(k 2p n) columns. 15
70 limitations of column sampling Recall that our algorithm accesses the diagonal of A along with Õ(k 2p n) columns. 15
71 limitations of column sampling Recall that our algorithm accesses the diagonal of A along with Õ(k 2p n) columns. If we take fewer columns, we can miss a p n p n block which contains a constant fraction of A s Frobenius norm. 15
72 column and row sampling Solution: Sample both rows and columns of A. 16
73 column and row sampling Solution: Sample both rows and columns of A. Instead of adaptive sampling we use ridge leverage scores, which can also be computed using an iterative sampling scheme making Õ(nk) accesses to A (Musco, Musco 17). 16
74 column and row sampling Solution: Sample both rows and columns of A. Instead of adaptive sampling we use ridge leverage scores, which can also be computed using an iterative sampling scheme making Õ(nk) accesses to A (Musco, Musco 17). Same intuition select a diverse set of columns which span a near-optimal low-rank approximation of the matrix. 16
75 column and row sampling Solution: Sample both rows and columns of A. Instead of adaptive sampling we use ridge leverage scores, which can also be computed using an iterative sampling scheme making Õ(nk) accesses to A (Musco, Musco 17). Same intuition select a diverse set of columns which span a near-optimal low-rank approximation of the matrix. Sample AS is a projection-cost-preserving sketch for A [Cohen et al 15, 17]. For any rank-k projection P, kas PASk 2 F =(1± )ka PAk2 F. 16
76 final algorithm Recover low-rank approximation using two-sided sampling and projection-cost-preserving sketch property. 17
77 final algorithm Recover low-rank approximation using two-sided sampling and projection-cost-preserving sketch property. 17
78 final algorithm Recover low-rank approximation using two-sided sampling and projection-cost-preserving sketch property. 17
79 summary of main ideas View each entry of A as encoding a large amount of information about its square root B. Inparticulara ij = b T i b j. 18
80 summary of main ideas View each entry of A as encoding a large amount of information about its square root B. Inparticulara ij = b T i b j. Use this view to find a low-rank approximation to B using sublinear accesses to A. 18
81 summary of main ideas View each entry of A as encoding a large amount of information about its square root B. Inparticulara ij = b T i b j. Use this view to find a low-rank approximation to B using sublinear accesses to A. Since B has the same singular vectors as A and i(b) = p i(a), a low-rank approximation of B can used to find one for A, albietwitha p n factor loss in quality. 18
82 summary of main ideas View each entry of A as encoding a large amount of information about its square root B. Inparticulara ij = b T i b j. Use this view to find a low-rank approximation to B using sublinear accesses to A. Since B has the same singular vectors as A and i(b) = p i(a), a low-rank approximation of B can used to find one for A, albietwitha p n factor loss in quality. Obtain near-optimal complexity using ridge leverage scores to sample both rows and columns of A. 18
83 open questions 19
84 open questions What else can be done for PSD matrices? We give applications to ridge regression, but what other linear algebraic problems require a second look? 19
85 open questions What else can be done for PSD matrices? We give applications to ridge regression, but what other linear algebraic problems require a second look? Are there other natural classes of matrices that admit sublinear time low-rank approximation? 19
86 open questions What else can be done for PSD matrices? We give applications to ridge regression, but what other linear algebraic problems require a second look? Are there other natural classes of matrices that admit sublinear time low-rank approximation? Starting points are matrices that break the (nnz(a)) time lower bound: e.g. binary matrices, diagonally dominant matrices. 19
87 open questions What else can be done for PSD matrices? We give applications to ridge regression, but what other linear algebraic problems require a second look? Are there other natural classes of matrices that admit sublinear time low-rank approximation? Starting points are matrices that break the (nnz(a)) time lower bound: e.g. binary matrices, diagonally dominant matrices. What can we do when we have PSD matrices with additional structure? E.g. kernel matrices. 19
88 Thanks! Questions? 20
to be more efficient on enormous scale, in a stream, or in distributed settings.
16 Matrix Sketching The singular value decomposition (SVD) can be interpreted as finding the most dominant directions in an (n d) matrix A (or n points in R d ). Typically n > d. It is typically easy to
More information14 Singular Value Decomposition
14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationLecture 18 Nov 3rd, 2015
CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 18 Nov 3rd, 2015 Scribe: Jefferson Lee 1 Overview Low-rank approximation, Compression Sensing 2 Last Time We looked at three different
More information15 Singular Value Decomposition
15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationdimensionality reduction for k-means and low rank approximation
dimensionality reduction for k-means and low rank approximation Michael Cohen, Sam Elder, Cameron Musco, Christopher Musco, Mădălina Persu Massachusetts Institute of Technology 0 overview Simple techniques
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationDimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas
Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx
More informationLinear Algebra for Machine Learning. Sargur N. Srihari
Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it
More informationLecture 7: Positive Semidefinite Matrices
Lecture 7: Positive Semidefinite Matrices Rajat Mittal IIT Kanpur The main aim of this lecture note is to prepare your background for semidefinite programming. We have already seen some linear algebra.
More informationReview problems for MA 54, Fall 2004.
Review problems for MA 54, Fall 2004. Below are the review problems for the final. They are mostly homework problems, or very similar. If you are comfortable doing these problems, you should be fine on
More informationLarge Scale Data Analysis Using Deep Learning
Large Scale Data Analysis Using Deep Learning Linear Algebra U Kang Seoul National University U Kang 1 In This Lecture Overview of linear algebra (but, not a comprehensive survey) Focused on the subset
More informationRandomized Numerical Linear Algebra: Review and Progresses
ized ized SVD ized : Review and Progresses Zhihua Department of Computer Science and Engineering Shanghai Jiao Tong University The 12th China Workshop on Machine Learning and Applications Xi an, November
More informationConditions for Robust Principal Component Analysis
Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and
More informationA fast randomized algorithm for approximating an SVD of a matrix
A fast randomized algorithm for approximating an SVD of a matrix Joint work with Franco Woolfe, Edo Liberty, and Vladimir Rokhlin Mark Tygert Program in Applied Mathematics Yale University Place July 17,
More informationLinear Algebra (Review) Volker Tresp 2017
Linear Algebra (Review) Volker Tresp 2017 1 Vectors k is a scalar (a number) c is a column vector. Thus in two dimensions, c = ( c1 c 2 ) (Advanced: More precisely, a vector is defined in a vector space.
More informationLECTURE NOTE #11 PROF. ALAN YUILLE
LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform
More informationrandomized block krylov methods for stronger and faster approximate svd
randomized block krylov methods for stronger and faster approximate svd Cameron Musco and Christopher Musco December 2, 25 Massachusetts Institute of Technology, EECS singular value decomposition n d left
More informationDeep Learning Book Notes Chapter 2: Linear Algebra
Deep Learning Book Notes Chapter 2: Linear Algebra Compiled By: Abhinaba Bala, Dakshit Agrawal, Mohit Jain Section 2.1: Scalars, Vectors, Matrices and Tensors Scalar Single Number Lowercase names in italic
More informationTighter Low-rank Approximation via Sampling the Leveraged Element
Tighter Low-rank Approximation via Sampling the Leveraged Element Srinadh Bhojanapalli The University of Texas at Austin bsrinadh@utexas.edu Prateek Jain Microsoft Research, India prajain@microsoft.com
More informationFiedler s Theorems on Nodal Domains
Spectral Graph Theory Lecture 7 Fiedler s Theorems on Nodal Domains Daniel A. Spielman September 19, 2018 7.1 Overview In today s lecture we will justify some of the behavior we observed when using eigenvectors
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear
More informationLecture 3: Review of Linear Algebra
ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters, transforms,
More informationRandomized algorithms for the approximation of matrices
Randomized algorithms for the approximation of matrices Luis Rademacher The Ohio State University Computer Science and Engineering (joint work with Amit Deshpande, Santosh Vempala, Grant Wang) Two topics
More informationLecture 3: Review of Linear Algebra
ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak, scribe: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters,
More informationMatrix Factorizations
1 Stat 540, Matrix Factorizations Matrix Factorizations LU Factorization Definition... Given a square k k matrix S, the LU factorization (or decomposition) represents S as the product of two triangular
More informationFiedler s Theorems on Nodal Domains
Spectral Graph Theory Lecture 7 Fiedler s Theorems on Nodal Domains Daniel A Spielman September 9, 202 7 About these notes These notes are not necessarily an accurate representation of what happened in
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationProblem Set 1. Homeworks will graded based on content and clarity. Please show your work clearly for full credit.
CSE 151: Introduction to Machine Learning Winter 2017 Problem Set 1 Instructor: Kamalika Chaudhuri Due on: Jan 28 Instructions This is a 40 point homework Homeworks will graded based on content and clarity
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 22 1 / 21 Overview
More informationLecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving
Stat260/CS294: Randomized Algorithms for Matrices and Data Lecture 24-12/02/2013 Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving Lecturer: Michael Mahoney Scribe: Michael Mahoney
More informationSketching as a Tool for Numerical Linear Algebra
Sketching as a Tool for Numerical Linear Algebra (Part 2) David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching
More informationLecture 1 and 2: Random Spanning Trees
Recent Advances in Approximation Algorithms Spring 2015 Lecture 1 and 2: Random Spanning Trees Lecturer: Shayan Oveis Gharan March 31st Disclaimer: These notes have not been subjected to the usual scrutiny
More informationORIE 6334 Spectral Graph Theory September 8, Lecture 6. In order to do the first proof, we need to use the following fact.
ORIE 6334 Spectral Graph Theory September 8, 2016 Lecture 6 Lecturer: David P. Williamson Scribe: Faisal Alkaabneh 1 The Matrix-Tree Theorem In this lecture, we continue to see the usefulness of the graph
More informationFunctional Analysis Review
Outline 9.520: Statistical Learning Theory and Applications February 8, 2010 Outline 1 2 3 4 Vector Space Outline A vector space is a set V with binary operations +: V V V and : R V V such that for all
More informationLinear Algebra Massoud Malek
CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product
More informationLinear Algebra and Eigenproblems
Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details
More information7 Principal Component Analysis
7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is
More informationCOMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare
COMP6237 Data Mining Covariance, EVD, PCA & SVD Jonathon Hare jsh2@ecs.soton.ac.uk Variance and Covariance Random Variables and Expected Values Mathematicians talk variance (and covariance) in terms of
More informationMATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)
MATH 20F: LINEAR ALGEBRA LECTURE B00 (T KEMP) Definition 01 If T (x) = Ax is a linear transformation from R n to R m then Nul (T ) = {x R n : T (x) = 0} = Nul (A) Ran (T ) = {Ax R m : x R n } = {b R m
More informationA randomized algorithm for approximating the SVD of a matrix
A randomized algorithm for approximating the SVD of a matrix Joint work with Per-Gunnar Martinsson (U. of Colorado) and Vladimir Rokhlin (Yale) Mark Tygert Program in Applied Mathematics Yale University
More informationLecture 9: Low Rank Approximation
CSE 521: Design and Analysis of Algorithms I Fall 2018 Lecture 9: Low Rank Approximation Lecturer: Shayan Oveis Gharan February 8th Scribe: Jun Qi Disclaimer: These notes have not been subjected to the
More informationChapter 3 Transformations
Chapter 3 Transformations An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Linear Transformations A function is called a linear transformation if 1. for every and 2. for every If we fix the bases
More informationEigenvalues and diagonalization
Eigenvalues and diagonalization Patrick Breheny November 15 Patrick Breheny BST 764: Applied Statistical Modeling 1/20 Introduction The next topic in our course, principal components analysis, revolves
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2
MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized
More information1 9/5 Matrices, vectors, and their applications
1 9/5 Matrices, vectors, and their applications Algebra: study of objects and operations on them. Linear algebra: object: matrices and vectors. operations: addition, multiplication etc. Algorithms/Geometric
More informationSingular Value Decomposition
Chapter 5 Singular Value Decomposition We now reach an important Chapter in this course concerned with the Singular Value Decomposition of a matrix A. SVD, as it is commonly referred to, is one of the
More informationSubspace sampling and relative-error matrix approximation
Subspace sampling and relative-error matrix approximation Petros Drineas Rensselaer Polytechnic Institute Computer Science Department (joint work with M. W. Mahoney) For papers, etc. drineas The CUR decomposition
More information10 Distributed Matrix Sketching
10 Distributed Matrix Sketching In order to define a distributed matrix sketching problem thoroughly, one has to specify the distributed model, data model and the partition model of data. The distributed
More informationCSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming
CSC2411 - Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming Notes taken by Mike Jamieson March 28, 2005 Summary: In this lecture, we introduce semidefinite programming
More informationLinear Systems. Carlo Tomasi. June 12, r = rank(a) b range(a) n r solutions
Linear Systems Carlo Tomasi June, 08 Section characterizes the existence and multiplicity of the solutions of a linear system in terms of the four fundamental spaces associated with the system s matrix
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More information2. Linear algebra. matrices and vectors. linear equations. range and nullspace of matrices. function of vectors, gradient and Hessian
FE661 - Statistical Methods for Financial Engineering 2. Linear algebra Jitkomut Songsiri matrices and vectors linear equations range and nullspace of matrices function of vectors, gradient and Hessian
More information10-725/36-725: Convex Optimization Prerequisite Topics
10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the
More informationSPARSE signal representations have gained popularity in recent
6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying
More informationCompound matrices and some classical inequalities
Compound matrices and some classical inequalities Tin-Yau Tam Mathematics & Statistics Auburn University Dec. 3, 04 We discuss some elegant proofs of several classical inequalities of matrices by using
More informationMathematical foundations - linear algebra
Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar
More informationLinear Algebra (Review) Volker Tresp 2018
Linear Algebra (Review) Volker Tresp 2018 1 Vectors k, M, N are scalars A one-dimensional array c is a column vector. Thus in two dimensions, ( ) c1 c = c 2 c i is the i-th component of c c T = (c 1, c
More informationMatrix decompositions
Matrix decompositions Zdeněk Dvořák May 19, 2015 Lemma 1 (Schur decomposition). If A is a symmetric real matrix, then there exists an orthogonal matrix Q and a diagonal matrix D such that A = QDQ T. The
More informationLinear Algebra. Session 12
Linear Algebra. Session 12 Dr. Marco A Roque Sol 08/01/2017 Example 12.1 Find the constant function that is the least squares fit to the following data x 0 1 2 3 f(x) 1 0 1 2 Solution c = 1 c = 0 f (x)
More information2. Every linear system with the same number of equations as unknowns has a unique solution.
1. For matrices A, B, C, A + B = A + C if and only if A = B. 2. Every linear system with the same number of equations as unknowns has a unique solution. 3. Every linear system with the same number of equations
More information4 Frequent Directions
4 Frequent Directions Edo Liberty[3] discovered a strong connection between matrix sketching and frequent items problems. In FREQUENTITEMS problem, we are given a stream S = hs 1,s 2,...,s n i of n items
More informationSVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)
Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationCS 143 Linear Algebra Review
CS 143 Linear Algebra Review Stefan Roth September 29, 2003 Introductory Remarks This review does not aim at mathematical rigor very much, but instead at ease of understanding and conciseness. Please see
More informationOutline Introduction: Problem Description Diculties Algebraic Structure: Algebraic Varieties Rank Decient Toeplitz Matrices Constructing Lower Rank St
Structured Lower Rank Approximation by Moody T. Chu (NCSU) joint with Robert E. Funderlic (NCSU) and Robert J. Plemmons (Wake Forest) March 5, 1998 Outline Introduction: Problem Description Diculties Algebraic
More informationLow-Rank PSD Approximation in Input-Sparsity Time
Low-Rank PSD Approximation in Input-Sparsity Time Kenneth L. Clarkson IBM Research Almaden klclarks@us.ibm.com David P. Woodruff IBM Research Almaden dpwoodru@us.ibm.com Abstract We give algorithms for
More informationCompressed Sensing and Robust Recovery of Low Rank Matrices
Compressed Sensing and Robust Recovery of Low Rank Matrices M. Fazel, E. Candès, B. Recht, P. Parrilo Electrical Engineering, University of Washington Applied and Computational Mathematics Dept., Caltech
More informationLinear algebra for computational statistics
University of Seoul May 3, 2018 Vector and Matrix Notation Denote 2-dimensional data array (n p matrix) by X. Denote the element in the ith row and the jth column of X by x ij or (X) ij. Denote by X j
More informationCS 246 Review of Linear Algebra 01/17/19
1 Linear algebra In this section we will discuss vectors and matrices. We denote the (i, j)th entry of a matrix A as A ij, and the ith entry of a vector as v i. 1.1 Vectors and vector operations A vector
More informationAPPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.
APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product
More informationThe Nyström Extension and Spectral Methods in Learning
Introduction Main Results Simulation Studies Summary The Nyström Extension and Spectral Methods in Learning New bounds and algorithms for high-dimensional data sets Patrick J. Wolfe (joint work with Mohamed-Ali
More informationNotes on Linear Algebra and Matrix Theory
Massimo Franceschet featuring Enrico Bozzo Scalar product The scalar product (a.k.a. dot product or inner product) of two real vectors x = (x 1,..., x n ) and y = (y 1,..., y n ) is not a vector but a
More informationSubspace Embeddings for the Polynomial Kernel
Subspace Embeddings for the Polynomial Kernel Haim Avron IBM T.J. Watson Research Center Yorktown Heights, NY 10598 haimav@us.ibm.com Huy L. Nguy ên Simons Institute, UC Berkeley Berkeley, CA 94720 hlnguyen@cs.princeton.edu
More informationElementary Linear Algebra Review for Exam 2 Exam is Monday, November 16th.
Elementary Linear Algebra Review for Exam Exam is Monday, November 6th. The exam will cover sections:.4,..4, 5. 5., 7., the class notes on Markov Models. You must be able to do each of the following. Section.4
More informationSingular Value Decomposition
Chapter 6 Singular Value Decomposition In Chapter 5, we derived a number of algorithms for computing the eigenvalues and eigenvectors of matrices A R n n. Having developed this machinery, we complete our
More informationBackground Mathematics (2/2) 1. David Barber
Background Mathematics (2/2) 1 David Barber University College London Modified by Samson Cheung (sccheung@ieee.org) 1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More information3 Best-Fit Subspaces and Singular Value Decomposition
3 Best-Fit Subspaces and Singular Value Decomposition (SVD) Think of the rows of an n d matrix A as n data points in a d-dimensional space and consider the problem of finding the best k-dimensional subspace
More informationThe value of a problem is not so much coming up with the answer as in the ideas and attempted ideas it forces on the would be solver I.N.
Math 410 Homework Problems In the following pages you will find all of the homework problems for the semester. Homework should be written out neatly and stapled and turned in at the beginning of class
More informationCS60021: Scalable Data Mining. Dimensionality Reduction
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya Assumption: Data lies on or near a
More informationLecture 9: Matrix approximation continued
0368-348-01-Algorithms in Data Mining Fall 013 Lecturer: Edo Liberty Lecture 9: Matrix approximation continued Warning: This note may contain typos and other inaccuracies which are usually discussed during
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationProperties of Matrices and Operations on Matrices
Properties of Matrices and Operations on Matrices A common data structure for statistical analysis is a rectangular array or matris. Rows represent individual observational units, or just observations,
More informationUsing SVD to Recommend Movies
Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last
More informationData Mining Lecture 4: Covariance, EVD, PCA & SVD
Data Mining Lecture 4: Covariance, EVD, PCA & SVD Jo Houghton ECS Southampton February 25, 2019 1 / 28 Variance and Covariance - Expectation A random variable takes on different values due to chance The
More informationSketching as a Tool for Numerical Linear Algebra All Lectures. David Woodruff IBM Almaden
Sketching as a Tool for Numerical Linear Algebra All Lectures David Woodruff IBM Almaden Massive data sets Examples Internet traffic logs Financial data etc. Algorithms Want nearly linear time or less
More informationFunctional Analysis Review
Functional Analysis Review Lorenzo Rosasco slides courtesy of Andre Wibisono 9.520: Statistical Learning Theory and Applications September 9, 2013 1 2 3 4 Vector Space A vector space is a set V with binary
More informationDefinitions for Quizzes
Definitions for Quizzes Italicized text (or something close to it) will be given to you. Plain text is (an example of) what you should write as a definition. [Bracketed text will not be given, nor does
More informationSingular Value Decomposition (SVD)
School of Computing National University of Singapore CS CS524 Theoretical Foundations of Multimedia More Linear Algebra Singular Value Decomposition (SVD) The highpoint of linear algebra Gilbert Strang
More informationLecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora
princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora Scribe: Today we continue the
More informationGeometric Modeling Summer Semester 2010 Mathematical Tools (1)
Geometric Modeling Summer Semester 2010 Mathematical Tools (1) Recap: Linear Algebra Today... Topics: Mathematical Background Linear algebra Analysis & differential geometry Numerical techniques Geometric
More informationSingular Value Decomposition
Singular Value Decomposition (Com S 477/577 Notes Yan-Bin Jia Sep, 7 Introduction Now comes a highlight of linear algebra. Any real m n matrix can be factored as A = UΣV T where U is an m m orthogonal
More information2. Review of Linear Algebra
2. Review of Linear Algebra ECE 83, Spring 217 In this course we will represent signals as vectors and operators (e.g., filters, transforms, etc) as matrices. This lecture reviews basic concepts from linear
More informationLinear Algebra March 16, 2019
Linear Algebra March 16, 2019 2 Contents 0.1 Notation................................ 4 1 Systems of linear equations, and matrices 5 1.1 Systems of linear equations..................... 5 1.2 Augmented
More informationLecture 9: SVD, Low Rank Approximation
CSE 521: Design and Analysis of Algorithms I Spring 2016 Lecture 9: SVD, Low Rank Approimation Lecturer: Shayan Oveis Gharan April 25th Scribe: Koosha Khalvati Disclaimer: hese notes have not been subjected
More informationDot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.
Dot Products K. Behrend April 3, 008 Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem. Contents The dot product 3. Length of a vector........................
More information