Dominant feature extraction Francqui Lecture 7-5-200 Paul Van Dooren Université catholique de Louvain CESAME, Louvain-la-Neuve, Belgium
Goal of this lecture Develop basic ideas for large scale dense matrices Recursive procedures for Dominant singular subspace Multipass iteration Subset selection Dominant eigenspace of positive definite matrix Possible extensions which are all based on solving cheap subproblems Show accuracy and complexity results
Dominant singular subspaces Given A mn, approximate it by a rank k factorization B mk C kn by solving min A BC 2, k m, n This has several applications in Image compression, Information retrieval and Model reduction (POD)
Information retrieval Low memory requirement 0(k(m + n)) Fast queries Ax L(Ux) 0(k(m + n)) time Easy to obtain 0(kmn) flops
Proper Orthogonal decomposition (POD) Compute a state trajectory for one typical" input Collect the principal directions to project on
Recursivity We pass once over the data with a window of length k and perform along the way a set of windowed SVD s of dimension m (k + l) Step : expand by appending l columns (Gram Schmidt) Step 2 : contract by deleting the l least important columns (SVD)
Expansion (G-S) Append column a + to the current approximation URV T to get [ ] [ ] [ ] [ ] URV T R 0 V T a + = U a+ Update with Gram Schmidt to recover a new decomposition Û ˆR ˆV T : using ˆr = U T a +, â = a + Uˆr, â = û ˆρ (since a + = Uˆr + û ˆρ)
Contraction (SVD) Now remove the l smallest singular values of this new Û ˆR ˆV T via Û ˆR ˆV T = (ÛG u)(g T u ˆRG v )(G T v ˆV T ) = and keeping U + R + V+ T T as best approximation of Û ˆR ˆV (just delete the l smallest singular values)
Complexity of one pair of steps The Gram Schmidt update (expansion) requires 4mk flops per column (essentially for the products ˆr = U T a +, â = a + Uˆr) [ ] R+ 0 For G u ˆRGv = one requires the left and right singular µ i vectors of ˆR which can be obtained in O(k 2 ) flops per singular value (using inverse iteration) Multiplying ÛG u and ˆV G v requires 4mk flops per deflated column The overall procedure requires 8mk flops per processed column and hence 8mnk flops for a rank k approximation to a m n matrix A [ ] [ ] R A2 One shows that A = U V 0 A T A2 where 22 A 2 F is known 22
Error estimates Let E := A  = UΣV T Û ˆΣ ˆV T and µ := E 2 Let ˆµ := max µ i where µ i is the neglected singular value at step i One shows that the error norm ˆµ σ k+ µ n k ˆµ c ˆµ ˆσ i σ i ˆσ i + ˆµ 2 /2ˆσ i tan θ k tan ˆθ k := ˆµ 2 /(ˆσ 2 k ˆµ 2 ), tan φ k tan ˆφ k := ˆµˆσ /(ˆσ 2 k ˆµ 2 ) where θ k, φ k are the canonical angles of dimension k : cos θ k := U T (:, k)û 2, cos φ k := V T (:, k) ˆV 2
Examples The bounds get much better when the gap σ k σ k+ is large
Convergence How quickly do we track the subpaces? How cos θ (i) k evolves with the time step i
Example Find the dominant behavior in an image sequence Images can have up to 0 6 pixels
Multipass iteration Low Rank Incremental SVD can be applied in several passes, say to [ ] A A A k After the first block (or pass ) a good approximation of the dominant space Û has already been constructed Going over to the next block (second pass ) will improve it, etc Theorem Convergence of the multipass method is linear, with approximate ratio of convergence ψ/( κ 2 ) <, where ψ measures orthogonality of the residual columns of A κ is the ratio σ k /σ k+ of A
Convergence behavior for increasing gap between signal" and noise" Number of INCSVD steps
Convergence behavior for increasing orthogonality between residual vectors" Number of INCSVD steps
Eigenfaces analysis Ten dominant left singular vectors of ORL Database of faces (40 images, 0 subjects, 922 pixels = 0304400 matrix) Using MATLAB SVD function Using one pass of incremental SVD Maximal angle : 63, maximum relative error in sing values : 48%
Conclusions Incremental SVD A useful and economical SVD approximation of A m,n For matrices with columns that are very large or arrive" with time Complexity is proportional to mnk and the number of passes" Algorithms due to [] Manjunath-Chandrasekaran-Wang (95) [2] Levy-Lindenbaum (00) [3] Chahlaoui-Gallivan-VanDooren (0) [4] Brand (03) [5] Baker-Gallivan-VanDooren (09) Convergence analysis and accuracy in refs [3],[4],[5]
Subset selection We want a good approximation" of A mn by a product B mk P T where P nk is a selection matrix" ie a submatrix of the identity I n This seems connected to min A BP T 2 and maybe similar techniques can be used as for incremental SVD Clearly, if B = AP, we just select a subset of the columns of A Rather than minimizing A BP T 2 we maximize vol(b) where k vol(b) = det(b T B) 2 = σ i (B), i= m k ( ) n There are possible choices and the problem is NP hard k and there is no polynomial time approximation algorithm
Heuristics Gu-Eisenstat show that the Strong Rank Revealing QR factorization (SRRQR) solves the following simpler problem B is sub-optimal if there is no swapping of a single column of A (yielding ˆB) that has a larger volume (constrained minimum) Here, we propose a simpler recursive updating" algorithm that has complexity O(mnk) rather than O(mn 2 ) for Gu-Eisenstat The idea is again based on a sliding window of size k + (or k + l) Sweep through columns of A while maintaining a best" subset B Append a column of A to B, yielding B + Contract B + to ˆB by deleting the weakest" column of B +
Deleting the weakest column Let B = A(:, : k) to start with and let B = QR where R is k k Append the next column a + of A to form B + and update its decomposition using Gram Schmidt B + := [ ] [ ] [ ] R 0 QR a + = Q a+ = [ Q ˆq ] [ ] R ˆr = Q ˆρ + R + with ˆr = Q T a +, â = a + Qˆr, â = ˆq ˆρ (since a + = Qˆr + ˆq ˆρ) Contract B + to ˆB by deleting the weakest" column of R + This can be done in O(mk 2 ) using Gu-Eisenstat s SRRQR method but an even simpler heuristic uses only O((m + k)k) flops
Golub-Klema-Stewart heuristic Let R + v = σ k+ u be the singular vector pair corresponding to the smallest singular value σ k+ of R + and let v i be the components of v Let R i be the submatrix obtained by deleting column i from R + then σk+ 2 σ 2 + ( σ2 k+ σ 2 ) v i 2 vol2 (R i ) k j= σ2 j σ2 k+ σ 2 k + ( σ2 k+ σ 2 k ) v i 2 Maximizing v i maximizes thus a lower bound on vol 2 (R i ) In practice this is almost always optimal and guaranteed to be so if σk+ 2 σk 2 + ( σ2 k+ σ 2 k ) v i 2 σ2 k+ σ 2 + ( σ2 k+ σ 2 ) v j 2 j i
GKS method Start with B = A(:, : k) = QR where R is k k For j = k + : n append column a + := A(:, j) to get B + update its QR decomposition to B + = Q + R + contract B + to yield a new ˆB using the GKS heuristic update its QR decomposition to ˆB = ˆQ ˆR One can verify the optimality by performing a second pass Notice that GKS is optimal when σ k+ = 0 since then vol(r i ) = v i k j= σ j
Dominant eigenspace of PSD matrix Typical applications Kernel Matrices (Machine Learning) Spectral Methods (Image Analysis) Correlation Matrices (Statistics and Signal Processing) Principal Component Analysis Karhunen-Loeve We use K N to denote the full N N positive definite matrix
Sweeping through K Suppose a rank m approximation of the dominant eigenspace of K n R nn, n m, is known, K n A n := U n M m U T n, M m R mm an spd matrix and U n R nm with U T n U n = I m
Sweeping through K Suppose a rank m approximation of the dominant eigenspace of K n R nn, n m, is known, K n A n := U n M m U T n, M m R mm an spd matrix and U n R nm with U T n U n = I m Obtain the (n + ) m eigenspace U n+ of the (n + ) (n + ) kernel matrix K n+ [ Kn a K n+ = a T b ] Ũn+,m+2 M m+2 Ũn+,m+2 T
Sweeping through K Suppose a rank m approximation of the dominant eigenspace of K n R nn, n m, is known, K n A n := U n M m U T n, M m R mm an spd matrix and U n R nm with U T n U n = I m Obtain the (n + ) m eigenspace U n+ of the (n + ) (n + ) kernel matrix K n+ [ Kn a K n+ = a T b ] Ũn+,m+2 M m+2 Ũn+,m+2 T Downdate M m+2 to get back to rank m
Sweeping through K Suppose a rank m approximation of the dominant eigenspace of K n R nn, n m, is known, K n A n := U n M m U T n, M m R mm an spd matrix and U n R nm with U T n U n = I m Obtain the (n + ) m eigenspace U n+ of the (n + ) (n + ) kernel matrix K n+ [ Kn a K n+ = a T b ] Ũn+,m+2 M m+2 Ũn+,m+2 T Downdate M m+2 to get back to rank m Downsize Ũn+ to get back to size n
Sweeping through K We show only the columns and rows of U and K that are involved Window size n = 5, rank k = 4
step Start with leading n n subproblem
step 2 Expand
step 3 Downdate and downsize
step 4
step 5 Expand
step 6 Downdate and downsize
step 7
step 8 Expand
step 9 Downdate and downsize
step 0
step Expand
step 2 Downdate and downsize
step 3
step 4 Expand
step 5 Downdate and downsize
step 6
Downdating K to fixed rank m Suppose a rank m approximation of the dominant eigenspace of K n R nn, n m, is known, K n A n := U n M m U T n, M m R mm an spd matrix and U n R nm with U T n U n = I m Obtain the (n + ) m eigenspace U n+ of the (n + ) (n + ) kernel matrix K n+ [ ] Kn a K n+ = a T b Ũn+,m+2 M m+2 Ũn+,m+2 T Downdate M m+2 to delete the smallest" two eigenvalues
Updating: Proposed algorithm Since a = U n U T n a + (I m U n U T n )a = U n r + ρu, with q = (I m U n Un T )a, ρ = q 2, u = q/ρ, we can write [ ] An a A n+ = a T b [ ] Un u m r U = M T n 0 ρ T u r T ρ b [ ] U Un u T n = M m+2 T u
Updating: Proposed algorithm Let M m = Q m Λ m Q T m where Λ m = diag(µ, µ 2,, µ m ), µ µ m > 0, QmQ T m = I m
Updating: Proposed algorithm Let M m = Q m Λ m Q T m where Λ m = diag(µ, µ 2,, µ m ), µ µ m > 0, QmQ T m = I m Let M m+2 = Q m+2 Λ m+2 QT m+2 where Λ m+2 = diag( λ, λ 2,, λ m+, λ m+2 ), QT m+2 Qm+2 = I m+2
Updating: Proposed algorithm Let M m = Q m Λ m Q T m where Λ m = diag(µ, µ 2,, µ m ), µ µ m > 0, QmQ T m = I m Let M m+2 = Q m+2 Λ m+2 QT m+2 where Λ m+2 = diag( λ, λ 2,, λ m+, λ m+2 ), QT m+2 Qm+2 = I m+2 By the interlacing property, we have λ µ λ 2 µ 2 µ m λ m+ 0 λ m+2
Updating: Proposed algorithm We want a simple orthogonal transformation H such that 0 0 M H [v m+ v m+2 ] = 0, H M m m+2 H T = λ m+, 0 λ m+2 with M m R mm Therefore [ Un u A n+ = ] H T M m λ m+ H λ m+2 and the new updated decomposition is given by Un T T u  n+ = U n+ Mm U T n+, [ Un u with U n+ given by the first m columns of ] H T
Cholesky factorization It is not needed to compute the whole spectral decomposition of the matrix M m+2
Cholesky factorization It is not needed to compute the whole spectral decomposition of the matrix M m+2 To compute H only the eigenvectors v m+, v m+2 corresponding to λ m+, λ m+2 are needed
Cholesky factorization It is not needed to compute the whole spectral decomposition of the matrix M m+2 To compute H only the eigenvectors v m+, v m+2 corresponding to λ m+, λ m+2 are needed To compute these vectors cheaply, one needs to maintain (and update) the Cholesky factorization of M m = L m L T m
Cholesky factorization It is not needed to compute the whole spectral decomposition of the matrix M m+2 To compute H only the eigenvectors v m+, v m+2 corresponding to λ m+, λ m+2 are needed To compute these vectors cheaply, one needs to maintain (and update) the Cholesky factorization of M m = L m L T m The eigenvectors are then obtained via inverse iteration and
Cholesky factorization It is not needed to compute the whole spectral decomposition of the matrix M m+2 To compute H only the eigenvectors v m+, v m+2 corresponding to λ m+, λ m+2 are needed To compute these vectors cheaply, one needs to maintain (and update) the Cholesky factorization of M m = L m L T m The eigenvectors are then obtained via inverse iteration and H can then be computed as a product of Householder or Givens transformations
Updating Cholesky M m+2 = = = M m r 0 ρ r T ρ b L m L T m r 0 ρ r T ρ b L m [ 0 T Im m t T I 2 = L m+2 Dm+2 LT m+2, S c ] [ L T m 0 m t I 2 ] where [ ] 0 ρ t = L m r and S c = ρ b t T t
step,,
step 2,,
step 3,,
step 4,,
step 5,,
step 6,,
step 7,,
step 8,,
step 9,,
step 9,,
step 0,,
step,,
step 2,,
step 3,,
step 4,,
step 5,,
step 6,,
step 7,,
Downsizing algorithm Let H v be orthogonal such that [ v T H v v = υ e,m, υ = v 2, U n+ = V ]
Downsizing algorithm Let H v be orthogonal such that [ v T H v v = υ e,m, υ = v 2, U n+ = V ] Then [ υ 0 0 U n+ H v = VH v ]
Downsizing algorithm Let H v be orthogonal such that [ v T H v v = υ e,m, υ = v 2, U n+ = V ] Then [ υ 0 0 U n+ H v = VH v To retrieve the orthonormality of VH v, it is sufficient to divide its first column of by υ 2 and therefore multiply the first column and row of M m by the same quantity ]
Downsizing algorithm Let H v be orthogonal such that [ v T H v v = υ e,m, υ = v 2, U n+ = V ] Then [ υ 0 0 U n+ H v = VH v To retrieve the orthonormality of VH v, it is sufficient to divide its first column of by υ 2 and therefore multiply the first column and row of M m by the same quantity If the matrix M m is factored as L m L T m this reduces to multiplying the first entry of L m by υ 2 ]
Downsizing algorithm Let H v be orthogonal such that [ v T H v v = υ e,m, υ = v 2, U n+ = V ] Then [ υ 0 0 U n+ H v = VH v To retrieve the orthonormality of VH v, it is sufficient to divide its first column of by υ 2 and therefore multiply the first column and row of M m by the same quantity If the matrix M m is factored as L m L T m this reduces to multiplying the first entry of L m by υ 2 Any row of U n+ can be chosen to be removed this way ]
Accuracy bounds When there is no downsizing where K n A n 2 F η n := ñ i=m+ λ (ñ) 2 n i + i=ñ+ K n A n 2 ζ n := λ (ñ) m+ + n A n := U n M m U T n, δ (+) i Kñ Añ 2 F = ñ i=m+ i=ñ+ { max δ (+) 2 n i + i=ñ+ } δ (+) i, δ ( ) i = λ (m+) i, δ ( ) i = λ (m+2) i, λ (ñ) 2 i, Kñ Añ 2 = λ (ñ) m+, δ ( ) 2 i,
Accuracy bounds In relation to the original spectrum K N one obtains the approximate bounds N K N A N 2 F (N m) λ 2 m+ and i=m+ λ 2 i λ m+ K N A N 2 cλ m+, When donwsizing the matrix as well, there are no guaranteed bounds
Example (no downsizing) The matrix considered in this example is a Kernel Matrix constructed from the Abalone benchmark data set http://archiveicsuciedu/ml/support/abalone, with radial basis kernel function ( k(x, y) = exp x ) y 2 2, 00
Example (no downsizing) The matrix considered in this example is a Kernel Matrix constructed from the Abalone benchmark data set http://archiveicsuciedu/ml/support/abalone, with radial basis kernel function ( k(x, y) = exp x ) y 2 2, 00 This data set has 477 training instances
0 4 0 2 0 0 0 2 0 4 0 6 0 8 0 0 0 2 0 0 20 30 40 50 60 70 80 90 00 Figure: Distribution of the largest 00 eigenvalues of the Abalone matrix in logarithmic scale
Table: Largest 9 eigenvalues of the Abalone matrix K N (second column), of A N obtained with updating with ñ = 500, m = 9 (third column) and with ñ = 500, m = 20 (fourth column), respectively λ i µ i, ñ = 500, m = 9 µ i, ñ = 500, m = 20 4483808255808e+3 448380825582e+3 4483808255805e+3 2774246723926e+ 2774246723935e+ 2774246723908e+ 396946486354603e- 39694648574339e- 396946486354575e- 282827838600384e- 282827838240747e- 28282783860794e- 87635493872957e-2 87635489366474e-2 876354938730078e-2 448976653877e-2 4489002296202e-2 4489766537462e-2 3950058249249e-2 395005033082028e-2 3950058245827e-2 34496594206443e-2 34495746496473e-2 34496594206963e-2 227595023456e-2 22750932394003e-2 22759506852e-2
0 500 000 500 2000 2500 3000 3500 4000 4500 0 2 0 3 0 4 0 5 0 6 Figure: Plot of the sequences of δ (+) n (blue line), δ ( ) n (green line), η n (red solid line), λ m+ (cyan solid line) and K N A N F (magenta solid line)
Table: Angles between the eigenvectors corresponding to the largest 9 eigenvalues of K N, computed by the function eigs of matlab and those computed by the proposed algorithm for ñ = 500, m = 9 (second column) and ñ = 500, m = 20 (third column) i (x i, x i ) (x i, ˆx i ) 36500e-08 84294e-08 2 39425e-08 29802e-08 3 23774e-06 569e-08 4 25086e-06 29802e-08 5 30084e-05 5e-07 6 20446e-04 4247e-08 7 2023e-04 490e-08 8 34670e-04 867e-08 9 59886e-04 2073e-08
Example 2 (downsizing) The following matrix has rank 3 F (i, j) = 3 exp ( (i µ k) 2 + (j µ k ) 2 ), i, j =,, 00, 2σ k k= with µ = [ 4 8 76 ], σ = [ 0 20 5 ] Let F = QΛQ T be its spectral decomposition and let R 0000 be a matrix of random numbers generated by the matlab function randn, and define = / 2 For this example, the considered SPD matrix is K N = F + ε T, ε = 0e 5
Example 2 5 05 0 20 40 60 0 0 20 40 60 80 00 80 00 Figure: Graph of the size of the entries of the matrix K N
Example 2 0 2 0 0 0 2 0 4 0 6 0 8 0 0 0 2 0 20 40 60 80 00 Figure: Distribution of the eigenvalues of the matrix K N in logarithmic scale
Example 2 06 05 04 03 02 0 0 0 0 20 40 60 80 00 Figure: Plot of the three dominant eigenvectors of K N
Example 2 0 4 0 5 0 6 0 7 30 40 50 60 70 80 90 00 Figure: Plot of the sequences of δ (+) n (blue dash dotted line), δ ( ) k (green dotted line), η n (red solid line), λ m+ (cyan solid line) and K N A N F (black solid line)
Example 2 Table: Largest three eigenvalues of the matrix K N (first column) Largest three eigenvalues computed with downdating procedure with minimal norm, with m = 3 and ñ = 30, 40, 50 (second, third and fourth column), respectively Largest three largest eigenvalues of K N computed with the former downdating procedure (fifth column) λ i µ i, ñ = 30 µ i, ñ = 40 µ i, ñ = 50 µ i, ñ = 50 7949478e0 73753e0 7820407e0 794727e0 3963329e0 526405e0 525563e0 5260243e0 526384e0 547202e-6 3963329e0 3948244e0 396323e0 3963329e0 4824060e-6
Conclusions A fast algorithm for to compute incrementally the dominant eigenspace of a positive definite matrix
Conclusions A fast algorithm for to compute incrementally the dominant eigenspace of a positive definite matrix Improvement on Hoegaerts, L, De Lathauwer, L, Goethals I, Suykens, JAK, Vandewalle, J, & De Moor, B Efficiently updating and tracking the dominant kernel principal components Neural Networks, 20, 220 229, 2007
Conclusions A fast algorithm for to compute incrementally the dominant eigenspace of a positive definite matrix Improvement on Hoegaerts, L, De Lathauwer, L, Goethals I, Suykens, JAK, Vandewalle, J, & De Moor, B Efficiently updating and tracking the dominant kernel principal components Neural Networks, 20, 220 229, 2007 The overall complexity of the incremental updating technique to compute an N m basis matrix U N for the dominant eigenspace of K N, is reduced from (m + 4)N 2 m + O(Nm 3 ) to 6N 2 m + O(Nm 2 )
Conclusions A fast algorithm for to compute incrementally the dominant eigenspace of a positive definite matrix Improvement on Hoegaerts, L, De Lathauwer, L, Goethals I, Suykens, JAK, Vandewalle, J, & De Moor, B Efficiently updating and tracking the dominant kernel principal components Neural Networks, 20, 220 229, 2007 The overall complexity of the incremental updating technique to compute an N m basis matrix U N for the dominant eigenspace of K N, is reduced from (m + 4)N 2 m + O(Nm 3 ) to 6N 2 m + O(Nm 2 ) When using both incremental updating and downsizing to compute the dominant eigenspace of ˆKñ (an ñ ñ principal submatrix of K N ), the complexity is reduced (2m + 4)Nñm + O(Nm 3 ) to 6Nñm + O(Nm 2 )
Conclusions A fast algorithm for to compute incrementally the dominant eigenspace of a positive definite matrix Improvement on Hoegaerts, L, De Lathauwer, L, Goethals I, Suykens, JAK, Vandewalle, J, & De Moor, B Efficiently updating and tracking the dominant kernel principal components Neural Networks, 20, 220 229, 2007 The overall complexity of the incremental updating technique to compute an N m basis matrix U N for the dominant eigenspace of K N, is reduced from (m + 4)N 2 m + O(Nm 3 ) to 6N 2 m + O(Nm 2 ) When using both incremental updating and downsizing to compute the dominant eigenspace of ˆKñ (an ñ ñ principal submatrix of K N ), the complexity is reduced (2m + 4)Nñm + O(Nm 3 ) to 6Nñm + O(Nm 2 ) This is in both cases essentially a reduction by a factor m
References Gu, Eisenstat, An efficient algorithm for computing a strong rank revealing QR factorization, SIAM SCISC, 996 Chahlaoui, Gallivan, Van Dooren, An incremental method for computing dominant singular subspaces, SIMAX, 200 Hoegaerts, De Lathauwer, Goethals, Suykens, Vandewalle, De Moor, Efficiently updating and tracking the dominant kernel principal components Neural Networks, 2007 Mastronardi, Tyrtishnikov, Van Dooren, A fast algorithm for updating and downsizing the dominant kernel principal components, SIMAX, 200 Baker, Gallivan, Van Dooren, Low-rank incrmenetal methods for computing dominant singular subspaces, submitted, 200 Ipsen, Van Dooren, Polynomial Time Subset Selection Via Updating, in preparation, 200