Matrix Approximations

Size: px

Start display at page:

Download "Matrix Approximations"

Corey Alexander
5 years ago
Views:

1 Matrix pproximations Sanjiv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 00 Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning

2 Latent Semantic Indexing (LSI) Given n documents with words from a vocabulary of size m, discover hidden topics and represent documents semantically. Process form a term-document matrix n documents m terms ij # of term i in j th document top left singular vectors represent topics transform input query in the -dim semantic space match against the semantically transformed database items n ~ O( B), m ~ O(00K) Large sparse/dense matrix Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning

3 Latent Semantic Indexing (LSI) Given n documents with words from a vocabulary of size m, discover hidden topics and represent documents semantically. Document representation and retrieval n documents U Σ V m terms Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 3

4 Latent Semantic Indexing (LSI) Given n documents with words from a vocabulary of size m, discover hidden topics and represent documents semantically. Document representation and retrieval n documents U Σ V m terms d j j th column dˆ j d j U Σ dˆ j Given a query q ˆ j Σ d qˆ Σ Find the nearest neighbors from database using sim q d ) Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 4 U U d q j ( ˆ, ˆ j

5 Principal Component nalysis (PC) Given n vectors, find the linear reconstruction with minimum mean squared error. Process form the centered data matrix (i.e. subtract mean from each vector) n vectors d dims get the top left singular vectors (equivalent to eigenvectors of covariance matrix) project the input data on the singular vectors n ~ O( B), d ~ O(00K) Large dense matrix Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 5

6 Kernel Methods Commonly used for nonlinear extensions in classification, regression, dimensionality reduction etc. Kernel Matrix Most ernel methods boil down to manipulation of ernel matrix n K = ( x i, x j ) n Symmetric Positive Semi-Definite (SPSD) matrix of size Get top/bottom eigenvectors (eg., ernel PC) or low-ran approximation (SVM) n ~ O( B) Very large dense matrix n n Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 6

7 Spectral Clustering Graph-based clustering method Less assumptions on data distribution than parametric model e.g.,gmm Luxburg [] Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 7

8 Spectral Clustering Graph-based clustering method Less assumptions on data distribution than parametric model e.g.,gmm Key Idea: Find a (low) -dim embedding such that neighbors remain close Yˆ = rg min Wij yi y j = rg miny LY Y i, j Y D W Solution: Find bottom eigenvectors of L (ignoring last) assuming graph is connected Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 8

9 Spectral Clustering Graph-based clustering method Less assumptions on data distribution than parametric model e.g.,gmm Procedure: Compute weight matrix W: W ij σ = exp( x x / ) or Dense Compute normalized Laplacian i j W ij = exp( xi x j / σ 0 otherwise Sparse ) if i ~ j / / = I D WD Symmetric or = I D W symmetric D ii = jwij Do a -means clustering in optimal reduced dims of : D / U or U Bottom eigenvectors of, ignoring last n ~ O( B), d ~ O(00K) Large dense or sparse matrix Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 9

10 wo scenarios Dense Matrices. Number of elements. Matrix-vector products O( n ) O( n ) 3. Spectrum usually falls exponentially for most real-world data Most energy contained in top few eigenvalues Low-ran approximation gives reasonable results 4. Commonly arise in ernel methods, regression, manifold learning, second order optimization (hessian) Sparse Matrices. Number of elements O(n). Very fast matrix-vector products O(n) 3. Spectrum is usually flatter (energy falls more slowly) 4. Commonly arise in spectral clustering, manifold learning, semisupervised learning Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 0

11 Low-ran pproximation Given a matrix of ran < min( m, n) = B m n C m n Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning

12 Low-ran pproximation Given a matrix of ran < min( m, n) dvantages = B m n Commonly used to discover structure in data i.e., Latent Semantic Indexing (LSI) e.g., discovering topics in documents or Recommendation system e.g., predicting user preferences C m n Storage reduces to only O (( m + n) ) instead of O(mn) O (( m + n) ) Matrix-vector product requires only instead of O(mn) Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning

13 Singular Value Decomposition (SVD) = [ m n] [ n] U V m [ n n] [ n n] Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 3

14 Singular Value Decomposition (SVD) wlog suppose m > n = [ m n] [ n] U V m [ n n] [ n n] Left singular vectors U U = Eigenvectors of I Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 4

15 Singular Value Decomposition (SVD) wlog suppose m > n = [ m n] [ n] U V m [ n n] [ n n] Left singular vectors U U = Eigenvectors of I Right singular vectors V V = Eigenvectors of I Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 5

16 Singular Value Decomposition (SVD) wlog suppose m > n = [ m n] [ n] U V m [ n n] [ n n] Left singular vectors U U = Eigenvectors of I diag( σ,..., σ n ) σ... σ 0 n σ are eigenvalues of or i Right singular vectors V V = Eigenvectors of I Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 6

17 Low-ran pproximation Frobenius Norm Spectral Norm F = = = r( ) = r( ) = i, j max x 0 ij x x = σ i σ i F n Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 7

18 Low-ran pproximation Frobenius Norm Spectral Norm F = = = r( ) = r( ) = i, j max x 0 ij x x = σ i σ i F n = rg min D ran ( D) = D F, = U m n Σ V m n Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 8

19 Low-ran pproximation Frobenius Norm Spectral Norm F = = = r( ) = r( ) = i, j max x 0 ij x x = σ i σ i F n = rg min D ran ( D) = D F, = U m n Σ V m n Σ Removing small entries in has noise-removal interpretation e.g., PC Computational Complexity full SVD O( mnmin{ m, n}) ran- SVD O(mn) Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 9

20 Matrix Projection = U V = U U = V V Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 0

21 Matrix Projection = U V = U U = V V Projection of on space spanned by columns of U Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning

22 V V U U V U = = = Matrix Projection V V U U V U ~ ~ ~ ~ ~ ~ ~ ~ = For approximate decomposition, low-ran approximation using SVD and matrix projection give different results! Projection of on space spanned by columns of U Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning

23 Power Method o compute the largest eigenvalue/eigenvector of a matrix ssume is a symmetric matrix, i.e. it is diagonalizable with, U U = λ diag( λ, λ,..., λ... λ n λ n ) Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 3

24 Power Method o compute the largest eigenvalue/eigenvector of a matrix ssume is a symmetric matrix, i.e. it is diagonalizable with, Power iteration U U =. Start with random vector. for =,,, K q = q diag( λ, λ,..., λ Normalize and compute eigenvalue z = q K λ = K z K K λ q z K K q 0 R n λ n λ n repeated matrix-vector product Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 4 )

25 Convergence properties Suppose, q = au + au a n u n 0 Since columns of U mae a basis in R n q = a λ u + a λ u a 0 n λ n u n Since u = λu u = i i i λ u i Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 5

26 Convergence properties Suppose, a n u n u a u a q =... 0 ) ( ), ( λ λ O u q dist = Since columns of U mae a basis in R n n n n u λ a u λ a u λ a q = = = j j n j j u λ λ a a u λ a q 0 i i i i u λ u λu u = = Since if 0 a Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 6

27 Convergence properties Suppose, a n u n u a u a q =... 0 ) ( ), ( λ λ O u q dist = Since columns of U mae a basis in R n n n n u λ a u λ a u λ a q =... 0 i i i i u λ u λu u = = Since if 0 a bigger gap leads to faster convergence easy to satisfy by random initialization Can be generalized to multiple eigenvectors simultaneously by Orthogonal Iteration via QR-decomposition! Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 7 + = = j j n j j u λ λ a a u λ a q 0

28 Page Ran Which document or lin is more popular independent of the query? [wiipedia] Random wal over pages If one randomly clics the lins then what is the probability that one will be at page B? Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 8

29 Page Ran Stationary probability distribution over pages djacency matrix n [ ] n if i j ij = 0 otherwise ransition matrix n p( j i) if i j = 0 otherwise ij [ ] n ransition matrix is stochastic, i.e., rows sum to Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 9

30 Page Ran Stationary probability distribution over pages djacency matrix n [ ] n if i j ij = 0 otherwise ransition matrix n p( j i) if i j = 0 otherwise ij [ ] n ransition matrix is stochastic, i.e., rows sum to Stationary Probability distribution given by the eigenvector corresponding to largest eigenvalue (i.e. ) of u = u Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 30

31 Page Ran Stationary probability distribution over pages djacency matrix n [ ] n if i j ij = 0 otherwise ransition matrix n p( j i) if i j = 0 otherwise ij [ ] n ransition matrix is stochastic, i.e., rows sum to Stationary Probability distribution given by the eigenvector corresponding to largest eigenvalue (i.e. ) of ( ) u0 Power Iteration u = Damping is used to mae sure graph connected u u = = α u + ( α)(/ n) + n-vector of s u Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 3

32 Lanczos Method technique to solve large sparse symmetric eigenproblems Useful when only top or bottom few eigenvalues/vectors are needed Generates a sequence of tridiagonal matrices whose extremal eigenvalues progressively better estimates of desired values Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 3

33 Lanczos Method technique to solve large sparse symmetric eigenproblems Useful when only top or bottom few eigenvalues/vectors are needed Generates a sequence of tridiagonal matrices whose extremal eigenvalues progressively better estimates of desired values Krylov Subspaces Κ(, q, ) = span{ q, q,..., q} Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 33

34 Lanczos Method technique to solve large sparse symmetric eigenproblems Useful when only top or bottom few eigenvalues/vectors are needed Generates a sequence of tridiagonal matrices whose extremal eigenvalues progressively better estimates of desired values Krylov Subspaces Key Idea Κ(, q, ) = span{ q, q,..., q} Successively generate a new orthonormal vector such that span{ q, q,..., q + } = span{ q, q,..., q} How to find { q, q,..., q + } Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 34

35 Why ridiagonalization Lanczos Method Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 35 Q Q = For any symmetric matrix : Where Q is n x n orthogonal matrix and is a tridiagonal matrix = n n α α α α 3 3 M M

36 Why ridiagonalization Lanczos Method Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 36 ]... [ n q q q Q = Q Q = For any symmetric matrix : Where Q is n x n orthogonal matrix and is a tridiagonal matrix = n n α α α α 3 3 M M Qe q = Q = Q = 0 0 M M e th let

37 Lanczos Method Why ridiagonalization For any symmetric matrix : Q Q = Where Q is n x n orthogonal matrix and is a tridiagonal matrix e 0 M = M 0 th Krylov Matrix let K m Q = Q Q = q q = Qe q... q [ n (, q, n) = [ q q... q] ] n n = Q [ e e e... e ] α α = M M R α 3 3 n α n Columns of Q form orthogonal bases for Krylov Matrix! Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 37

38 Construct partial Q and partial iteratively Lanczos Method Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 38 Q = Q Let s focus on th column of Q = n n α α α α 3 3 M M = q q α q q r q q I α q = + ) ( r q / = + r = if 0 r where

39 Lanczos Method Construct partial Q and partial iteratively Q = Q Let s focus on th column of Q q q = q + αq + q+ + = ( α I) q q if r 0 q / + = r where = r α α α = M M r0 = q, 0 =, =,... while ( 0) q = r / α = q q using orthormality of q s Need matrix-vector r = ( α I) q q product q = r n α Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 39 r 3 3 premultiply by q + n

40 Lanczos Method When to stop? Ideally until = ran of Krylov matrix Usually too long, so a threshold is used on residual Q = Q + r e Decompose S S = diag( θ,..., θ ) Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 40

41 Lanczos Method When to stop? Ideally until = ran of Krylov matrix Usually too long, so a threshold is used on residual Decompose estimated eigenvectors of : Q = Q + r S S = Q S e diag( θ,..., θ ) O() estimated eigenvalues of Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 4

42 Lanczos Method When to stop? Ideally until = ran of Krylov matrix Usually too long, so a threshold is used on residual Q = Q + r e Decompose estimated eigenvectors of : S S = Q S diag( θ,..., θ ) estimated eigenvalues of ccuracy min μ λ( ) θ i μ s i Convergence λ θ λ ( λ λn ) tan( φ ) /( c ( + ρ )) Kaniel-Page heory cos( φ q = u Chebyshev polynomial ) top eigenvector of ρ = ( λ λ ) /( λ λ n ) Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 4

43 Lanczos Method When to stop? Ideally until = ran of Krylov matrix Usually too long, so a threshold is used on residual Q = Q + r e Decompose estimated eigenvectors of : S S = Q S diag( θ,..., θ ) estimated eigenvalues of ccuracy min μ λ( ) θ i μ s i Convergence λ θ λ ( λ λn ) tan( φ ) /( c ( + ρ )) Kaniel-Page heory cos( φ q = u Chebyshev polynomial top eigenvector of Lanczos iterations converge faster than power iteration! ) Practical Implementation: void rounding errors Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 43

44 rnoldi s Method technique to solve large sparse asymmetric eigenproblems Generates a sequence of Hessenberg matrices with extremal eigenvalues progressively better estimates Q Q = H Q = QH q = + H = full hiqi r = q hiqi i= i= Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 44

45 rnoldi s Method technique to solve large sparse asymmetric eigenproblems Generates a sequence of Hessenberg matrices with extremal eigenvalues progressively better estimates if h +, fter steps Q Q = H Q = QH q = + H = full hiqi r = q hiqi i= i= 0 q + = r / h +, Q = Q H + r H y = λy e ( λi) Q y = ( e y) r ritz value ritz vector h = +, Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 45 O( ) r In practice, applied with multiple restarts!

46 Randomized SVD and low-ran approx Basic Problem oversampling Given an m n matrix, and an integer l = + p, find an m l orthonormal matrix Q such that m n Q Q m l he columns of Q form orthonormal basis for the range of l << m, n Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 46

47 Randomized SVD and low-ran approx Basic Problem oversampling Given an m n matrix, and an integer l = + p, find an m l orthonormal matrix Q such that m n Q Q m l he columns of Q form orthonormal basis for the range of l << m, n How to do SVD?. Form a small (l n) matrix:. Compute SVD of B: 3. Set U = QUˆ B = Uˆ B = Q Σ V O(mnl) O( nl O( ml ) ) Of course, the best Q is: Q = How to build approximate Q? Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 47 U l Expensive!

48 Randomized method Basic Problem Given an m n matrix, and an integer l, find an m lorthonormal matrix Q such that Q Q Procedure. Draw random vectors from P(w) w, w,..., w l R n. Form projected sample vectors y = w,..., y l = w l R m 3. Form orthonormal vectors q, q,... q l R m Span( q, q,..., ql ) = Span( y, y,..., yl ) Gram-Schmidt or QR Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 48

49 Randomized method Basic Problem Given an m n matrix, and an integer l, find an m lorthonormal matrix Q such that Q Q Matrix-version. Construct a random matrix Ω l of size n l O(nl) most expensive!. Form m l sample matrix Yl = Ωl O(mnl) Use randomized FF for O(mnlog(l)) 3. Form an m l orthonormal matrix Q l such that Y l = QlQl Y l O( ml ) Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 49

50 Randomized method Basic Problem Given an m n matrix, and an integer l, find an m lorthonormal matrix Q such that Q Q Matrix-version. Construct a random matrix Ω l of size n l O(nl) most expensive!. Form m l sample matrix Yl = Ωl O(mnl) Use randomized FF for O(mnlog(l)) 3. Form an m l orthonormal matrix Q l such that Y l = QlQl Y l O( ml ) Error in pproximation Q Q σl+ Q l Q l [ + + p. min( m, n)] σ+ with probability 6 p p Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 50

51 Matrices with slowly decaying spectrum Basic Idea Incorporate Krylov-space type power iteration B = ( ) q same singular vectors as of but fast decaying singular values Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 5

52 Matrices with slowly decaying spectrum Basic Idea Incorporate Krylov-space type power iteration B = ( ) q same singular vectors as of but fast decaying singular values lgorithm. Construct a random matrix Ω l of size n l. Form m l sample matrix Z = ( ) q Ω l 3. Form an m l orthonormal matrix Q l from Z using partial QR Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 5

53 Matrices with slowly decaying spectrum Basic Idea Incorporate Krylov-space type power iteration B = ( ) q same singular vectors as of but fast decaying singular values lgorithm. Construct a random matrix Ω l of size n l. Form m l sample matrix Z = ( ) q Ω l 3. Form an m l orthonormal matrix Q l from Z using partial QR Error in pproximation Q l Q l /(q+ ) [ + + p. min( m, n)] σ+ with probability 6 p p Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 53

54 Randomized method: Example Eigenfaces m = 754 faces n = pixels Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 54

55 Randomized method dvantages. Much faster than standard methods: O ( mnlog ) instead of O(mn). Need to loo at data only very few times (sometimes single pass!) 3. Easy parallel implementation 4. he error in approximation can be made arbitrarily small with very high probability (even 0-0 ) inexpensively 5. Simple to code Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 55

56 Randomized Vs Sparse Methods Randomized methods Build the approximate range of by multiplying with a new random vector progressively Sparse Methods Build the approximate range of by multiplying recursively with the outcome of the previous stage starting with a random vector dvantages of randomized method. Similar computational complexity. Can be parallelized y = w,..., y l = w l y 3. Provide better approximations even for small spectrum gaps = w,..., y l = yl if is square 4. Can learn from very few (sometimes just one) passes over the data Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 56

57 References. Deerwester, S., et al, Improving Information Retrieval with Latent Semantic Indexing, Proceedings of the 5st nnual Meeting of the merican Society for Information Science 5, Ulrie Von Luxburg, tutorial on spectral clustering, ech Report, R-49, Max-plan Institute. 3. Golub and Van Loan, Matrix Computations, Johns Hopins Press, Sergey Brin, Larry Page (998). "he natomy of a Large-Scale Hypertextual Web Search Engine". Proceedings of the 7th international conference on World Wide Web (WWW). Brisbane, ustralia. pp Frieze, R. Kannan and S. Vempala, Fast Monte-Carlo lgorithms for finding low-ran approximations, Proceedings of the Foundations of Computer Science, P. Drineas and R. Kannan, M. Mahoney, Fast Monte Carlo lgorithms for Matrices I: pproximating Matrix Multiplication, ech-report R Halo, Martinsson, & ropp, "Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions, CM Report, 009. Sanjiv Kumar 9/4/00 EECS6898 Large Scale Machine Learning 57

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection Eigenvalue Problems Last Time Social Network Graphs Betweenness Girvan-Newman Algorithm Graph Laplacian Spectral Bisection λ 2, w 2 Today Small deviation into eigenvalue problems Formulation Standard eigenvalue