Dominant feature extraction

Size: px

Start display at page:

Download "Dominant feature extraction"

Katherine Tucker
5 years ago
Views:

1 Dominant feature extraction Francqui Lecture Paul Van Dooren Université catholique de Louvain CESAME, Louvain-la-Neuve, Belgium

2 Goal of this lecture Develop basic ideas for large scale dense matrices Recursive procedures for Dominant singular subspace Multipass iteration Subset selection Dominant eigenspace of positive definite matrix Possible extensions which are all based on solving cheap subproblems Show accuracy and complexity results

3 Dominant singular subspaces Given A mn, approximate it by a rank k factorization B mk C kn by solving min A BC 2, k m, n This has several applications in Image compression, Information retrieval and Model reduction (POD)

4 Information retrieval Low memory requirement 0(k(m + n)) Fast queries Ax L(Ux) 0(k(m + n)) time Easy to obtain 0(kmn) flops

5 Proper Orthogonal decomposition (POD) Compute a state trajectory for one typical" input Collect the principal directions to project on

6 Recursivity We pass once over the data with a window of length k and perform along the way a set of windowed SVD s of dimension m (k + l) Step : expand by appending l columns (Gram Schmidt) Step 2 : contract by deleting the l least important columns (SVD)

7 Expansion (G-S) Append column a + to the current approximation URV T to get [ ] [ ] [ ] [ ] URV T R 0 V T a + = U a+ Update with Gram Schmidt to recover a new decomposition Û ˆR ˆV T : using ˆr = U T a +, â = a + Uˆr, â = û ˆρ (since a + = Uˆr + û ˆρ)

8 Contraction (SVD) Now remove the l smallest singular values of this new Û ˆR ˆV T via Û ˆR ˆV T = (ÛG u)(g T u ˆRG v )(G T v ˆV T ) = and keeping U + R + V+ T T as best approximation of Û ˆR ˆV (just delete the l smallest singular values)

9 Complexity of one pair of steps The Gram Schmidt update (expansion) requires 4mk flops per column (essentially for the products ˆr = U T a +, â = a + Uˆr) [ ] R+ 0 For G u ˆRGv = one requires the left and right singular µ i vectors of ˆR which can be obtained in O(k 2 ) flops per singular value (using inverse iteration) Multiplying ÛG u and ˆV G v requires 4mk flops per deflated column The overall procedure requires 8mk flops per processed column and hence 8mnk flops for a rank k approximation to a m n matrix A [ ] [ ] R A2 One shows that A = U V 0 A T A2 where 22 A 2 F is known 22

10 Error estimates Let E := A Â = UΣV T Û ˆΣ ˆV T and µ := E 2 Let ˆµ := max µ i where µ i is the neglected singular value at step i One shows that the error norm ˆµ σ k+ µ n k ˆµ c ˆµ ˆσ i σ i ˆσ i + ˆµ 2 /2ˆσ i tan θ k tan ˆθ k := ˆµ 2 /(ˆσ 2 k ˆµ 2 ), tan φ k tan ˆφ k := ˆµˆσ /(ˆσ 2 k ˆµ 2 ) where θ k, φ k are the canonical angles of dimension k : cos θ k := U T (:, k)û 2, cos φ k := V T (:, k) ˆV 2

11 Examples The bounds get much better when the gap σ k σ k+ is large

12 Convergence How quickly do we track the subpaces? How cos θ (i) k evolves with the time step i

13 Example Find the dominant behavior in an image sequence Images can have up to 0 6 pixels

14 Multipass iteration Low Rank Incremental SVD can be applied in several passes, say to [ ] A A A k After the first block (or pass ) a good approximation of the dominant space Û has already been constructed Going over to the next block (second pass ) will improve it, etc Theorem Convergence of the multipass method is linear, with approximate ratio of convergence ψ/( κ 2 ) <, where ψ measures orthogonality of the residual columns of A κ is the ratio σ k /σ k+ of A

15 Convergence behavior for increasing gap between signal" and noise" Number of INCSVD steps

16 Convergence behavior for increasing orthogonality between residual vectors" Number of INCSVD steps

matrix) Using MATLAB SVD function Using one pass of incremental

17 Eigenfaces analysis Ten dominant left singular vectors of ORL Database of faces (40 images, 0 subjects, 922 pixels = matrix) Using MATLAB SVD function Using one pass of incremental SVD Maximal angle : 63, maximum relative error in sing values : 48%

18 Conclusions Incremental SVD A useful and economical SVD approximation of A m,n For matrices with columns that are very large or arrive" with time Complexity is proportional to mnk and the number of passes" Algorithms due to [] Manjunath-Chandrasekaran-Wang (95) [2] Levy-Lindenbaum (00) [3] Chahlaoui-Gallivan-VanDooren (0) [4] Brand (03) [5] Baker-Gallivan-VanDooren (09) Convergence analysis and accuracy in refs [3],[4],[5]

19 Subset selection We want a good approximation" of A mn by a product B mk P T where P nk is a selection matrix" ie a submatrix of the identity I n This seems connected to min A BP T 2 and maybe similar techniques can be used as for incremental SVD Clearly, if B = AP, we just select a subset of the columns of A Rather than minimizing A BP T 2 we maximize vol(b) where k vol(b) = det(b T B) 2 = σ i (B), i= m k ( ) n There are possible choices and the problem is NP hard k and there is no polynomial time approximation algorithm

20 Heuristics Gu-Eisenstat show that the Strong Rank Revealing QR factorization (SRRQR) solves the following simpler problem B is sub-optimal if there is no swapping of a single column of A (yielding ˆB) that has a larger volume (constrained minimum) Here, we propose a simpler recursive updating" algorithm that has complexity O(mnk) rather than O(mn 2 ) for Gu-Eisenstat The idea is again based on a sliding window of size k + (or k + l) Sweep through columns of A while maintaining a best" subset B Append a column of A to B, yielding B + Contract B + to ˆB by deleting the weakest" column of B +

21 Deleting the weakest column Let B = A(:, : k) to start with and let B = QR where R is k k Append the next column a + of A to form B + and update its decomposition using Gram Schmidt B + := [ ] [ ] [ ] R 0 QR a + = Q a+ = [ Q ˆq ] [ ] R ˆr = Q ˆρ + R + with ˆr = Q T a +, â = a + Qˆr, â = ˆq ˆρ (since a + = Qˆr + ˆq ˆρ) Contract B + to ˆB by deleting the weakest" column of R + This can be done in O(mk 2 ) using Gu-Eisenstat s SRRQR method but an even simpler heuristic uses only O((m + k)k) flops

22 Golub-Klema-Stewart heuristic Let R + v = σ k+ u be the singular vector pair corresponding to the smallest singular value σ k+ of R + and let v i be the components of v Let R i be the submatrix obtained by deleting column i from R + then σk+ 2 σ 2 + ( σ2 k+ σ 2 ) v i 2 vol2 (R i ) k j= σ2 j σ2 k+ σ 2 k + ( σ2 k+ σ 2 k ) v i 2 Maximizing v i maximizes thus a lower bound on vol 2 (R i ) In practice this is almost always optimal and guaranteed to be so if σk+ 2 σk 2 + ( σ2 k+ σ 2 k ) v i 2 σ2 k+ σ 2 + ( σ2 k+ σ 2 ) v j 2 j i

23 GKS method Start with B = A(:, : k) = QR where R is k k For j = k + : n append column a + := A(:, j) to get B + update its QR decomposition to B + = Q + R + contract B + to yield a new ˆB using the GKS heuristic update its QR decomposition to ˆB = ˆQ ˆR One can verify the optimality by performing a second pass Notice that GKS is optimal when σ k+ = 0 since then vol(r i ) = v i k j= σ j

24 Dominant eigenspace of PSD matrix Typical applications Kernel Matrices (Machine Learning) Spectral Methods (Image Analysis) Correlation Matrices (Statistics and Signal Processing) Principal Component Analysis Karhunen-Loeve We use K N to denote the full N N positive definite matrix

25 Sweeping through K Suppose a rank m approximation of the dominant eigenspace of K n R nn, n m, is known, K n A n := U n M m U T n, M m R mm an spd matrix and U n R nm with U T n U n = I m

26 Sweeping through K Suppose a rank m approximation of the dominant eigenspace of K n R nn, n m, is known, K n A n := U n M m U T n, M m R mm an spd matrix and U n R nm with U T n U n = I m Obtain the (n + ) m eigenspace U n+ of the (n + ) (n + ) kernel matrix K n+ [ Kn a K n+ = a T b ] Ũn+,m+2 M m+2 Ũn+,m+2 T

27 Sweeping through K Suppose a rank m approximation of the dominant eigenspace of K n R nn, n m, is known, K n A n := U n M m U T n, M m R mm an spd matrix and U n R nm with U T n U n = I m Obtain the (n + ) m eigenspace U n+ of the (n + ) (n + ) kernel matrix K n+ [ Kn a K n+ = a T b ] Ũn+,m+2 M m+2 Ũn+,m+2 T Downdate M m+2 to get back to rank m

28 Sweeping through K Suppose a rank m approximation of the dominant eigenspace of K n R nn, n m, is known, K n A n := U n M m U T n, M m R mm an spd matrix and U n R nm with U T n U n = I m Obtain the (n + ) m eigenspace U n+ of the (n + ) (n + ) kernel matrix K n+ [ Kn a K n+ = a T b ] Ũn+,m+2 M m+2 Ũn+,m+2 T Downdate M m+2 to get back to rank m Downsize Ũn+ to get back to size n

29 Sweeping through K We show only the columns and rows of U and K that are involved Window size n = 5, rank k = 4

30 step Start with leading n n subproblem

31 step 2 Expand

32 step 3 Downdate and downsize

33 step 4

34 step 5 Expand

35 step 6 Downdate and downsize

36 step 7

37 step 8 Expand

38 step 9 Downdate and downsize

39 step 0

40 step Expand

41 step 2 Downdate and downsize

42 step 3

43 step 4 Expand

44 step 5 Downdate and downsize

45 step 6

46 Downdating K to fixed rank m Suppose a rank m approximation of the dominant eigenspace of K n R nn, n m, is known, K n A n := U n M m U T n, M m R mm an spd matrix and U n R nm with U T n U n = I m Obtain the (n + ) m eigenspace U n+ of the (n + ) (n + ) kernel matrix K n+ [ ] Kn a K n+ = a T b Ũn+,m+2 M m+2 Ũn+,m+2 T Downdate M m+2 to delete the smallest" two eigenvalues

47 Updating: Proposed algorithm Since a = U n U T n a + (I m U n U T n )a = U n r + ρu, with q = (I m U n Un T )a, ρ = q 2, u = q/ρ, we can write [ ] An a A n+ = a T b [ ] Un u m r U = M T n 0 ρ T u r T ρ b [ ] U Un u T n = M m+2 T u

48 Updating: Proposed algorithm Let M m = Q m Λ m Q T m where Λ m = diag(µ, µ 2,, µ m ), µ µ m > 0, QmQ T m = I m

49 Updating: Proposed algorithm Let M m = Q m Λ m Q T m where Λ m = diag(µ, µ 2,, µ m ), µ µ m > 0, QmQ T m = I m Let M m+2 = Q m+2 Λ m+2 QT m+2 where Λ m+2 = diag( λ, λ 2,, λ m+, λ m+2 ), QT m+2 Qm+2 = I m+2

50 Updating: Proposed algorithm Let M m = Q m Λ m Q T m where Λ m = diag(µ, µ 2,, µ m ), µ µ m > 0, QmQ T m = I m Let M m+2 = Q m+2 Λ m+2 QT m+2 where Λ m+2 = diag( λ, λ 2,, λ m+, λ m+2 ), QT m+2 Qm+2 = I m+2 By the interlacing property, we have λ µ λ 2 µ 2 µ m λ m+ 0 λ m+2

51 Updating: Proposed algorithm We want a simple orthogonal transformation H such that 0 0 M H [v m+ v m+2 ] = 0, H M m m+2 H T = λ m+, 0 λ m+2 with M m R mm Therefore [ Un u A n+ = ] H T M m λ m+ H λ m+2 and the new updated decomposition is given by Un T T u Â n+ = U n+ Mm U T n+, [ Un u with U n+ given by the first m columns of ] H T

52 Cholesky factorization It is not needed to compute the whole spectral decomposition of the matrix M m+2

53 Cholesky factorization It is not needed to compute the whole spectral decomposition of the matrix M m+2 To compute H only the eigenvectors v m+, v m+2 corresponding to λ m+, λ m+2 are needed

54 Cholesky factorization It is not needed to compute the whole spectral decomposition of the matrix M m+2 To compute H only the eigenvectors v m+, v m+2 corresponding to λ m+, λ m+2 are needed To compute these vectors cheaply, one needs to maintain (and update) the Cholesky factorization of M m = L m L T m

55 Cholesky factorization It is not needed to compute the whole spectral decomposition of the matrix M m+2 To compute H only the eigenvectors v m+, v m+2 corresponding to λ m+, λ m+2 are needed To compute these vectors cheaply, one needs to maintain (and update) the Cholesky factorization of M m = L m L T m The eigenvectors are then obtained via inverse iteration and

56 Cholesky factorization It is not needed to compute the whole spectral decomposition of the matrix M m+2 To compute H only the eigenvectors v m+, v m+2 corresponding to λ m+, λ m+2 are needed To compute these vectors cheaply, one needs to maintain (and update) the Cholesky factorization of M m = L m L T m The eigenvectors are then obtained via inverse iteration and H can then be computed as a product of Householder or Givens transformations

57 Updating Cholesky M m+2 = = = M m r 0 ρ r T ρ b L m L T m r 0 ρ r T ρ b L m [ 0 T Im m t T I 2 = L m+2 Dm+2 LT m+2, S c ] [ L T m 0 m t I 2 ] where [ ] 0 ρ t = L m r and S c = ρ b t T t

58 step,,

59 step 2,,

60 step 3,,

61 step 4,,

62 step 5,,

63 step 6,,

64 step 7,,

65 step 8,,

66 step 9,,

67 step 9,,

68 step 0,,

69 step,,

70 step 2,,

71 step 3,,

72 step 4,,

73 step 5,,

74 step 6,,

75 step 7,,

76 Downsizing algorithm Let H v be orthogonal such that [ v T H v v = υ e,m, υ = v 2, U n+ = V ]

77 Downsizing algorithm Let H v be orthogonal such that [ v T H v v = υ e,m, υ = v 2, U n+ = V ] Then [ υ 0 0 U n+ H v = VH v ]

78 Downsizing algorithm Let H v be orthogonal such that [ v T H v v = υ e,m, υ = v 2, U n+ = V ] Then [ υ 0 0 U n+ H v = VH v To retrieve the orthonormality of VH v, it is sufficient to divide its first column of by υ 2 and therefore multiply the first column and row of M m by the same quantity ]

79 Downsizing algorithm Let H v be orthogonal such that [ v T H v v = υ e,m, υ = v 2, U n+ = V ] Then [ υ 0 0 U n+ H v = VH v To retrieve the orthonormality of VH v, it is sufficient to divide its first column of by υ 2 and therefore multiply the first column and row of M m by the same quantity If the matrix M m is factored as L m L T m this reduces to multiplying the first entry of L m by υ 2 ]

80 Downsizing algorithm Let H v be orthogonal such that [ v T H v v = υ e,m, υ = v 2, U n+ = V ] Then [ υ 0 0 U n+ H v = VH v To retrieve the orthonormality of VH v, it is sufficient to divide its first column of by υ 2 and therefore multiply the first column and row of M m by the same quantity If the matrix M m is factored as L m L T m this reduces to multiplying the first entry of L m by υ 2 Any row of U n+ can be chosen to be removed this way ]

81 Accuracy bounds When there is no downsizing where K n A n 2 F η n := ñ i=m+ λ (ñ) 2 n i + i=ñ+ K n A n 2 ζ n := λ (ñ) m+ + n A n := U n M m U T n, δ (+) i Kñ Añ 2 F = ñ i=m+ i=ñ+ { max δ (+) 2 n i + i=ñ+ } δ (+) i, δ ( ) i = λ (m+) i, δ ( ) i = λ (m+2) i, λ (ñ) 2 i, Kñ Añ 2 = λ (ñ) m+, δ ( ) 2 i,

82 Accuracy bounds In relation to the original spectrum K N one obtains the approximate bounds N K N A N 2 F (N m) λ 2 m+ and i=m+ λ 2 i λ m+ K N A N 2 cλ m+, When donwsizing the matrix as well, there are no guaranteed bounds

83 Example (no downsizing) The matrix considered in this example is a Kernel Matrix constructed from the Abalone benchmark data set with radial basis kernel function ( k(x, y) = exp x ) y 2 2, 00

84 Example (no downsizing) The matrix considered in this example is a Kernel Matrix constructed from the Abalone benchmark data set with radial basis kernel function ( k(x, y) = exp x ) y 2 2, 00 This data set has 477 training instances

85 Figure: Distribution of the largest 00 eigenvalues of the Abalone matrix in logarithmic scale

86 Table: Largest 9 eigenvalues of the Abalone matrix K N (second column), of A N obtained with updating with ñ = 500, m = 9 (third column) and with ñ = 500, m = 20 (fourth column), respectively λ i µ i, ñ = 500, m = 9 µ i, ñ = 500, m = e e e e e e e e e e e e e e e e e e e e e e e e e e e-2

87 Figure: Plot of the sequences of δ (+) n (blue line), δ ( ) n (green line), η n (red solid line), λ m+ (cyan solid line) and K N A N F (magenta solid line)

88 Table: Angles between the eigenvectors corresponding to the largest 9 eigenvalues of K N, computed by the function eigs of matlab and those computed by the proposed algorithm for ñ = 500, m = 9 (second column) and ñ = 500, m = 20 (third column) i (x i, x i ) (x i, ˆx i ) 36500e e e e e e e e e-05 5e e e e e e e e e-08

89 Example 2 (downsizing) The following matrix has rank 3 F (i, j) = 3 exp ( (i µ k) 2 + (j µ k ) 2 ), i, j =,, 00, 2σ k k= with µ = [ ], σ = [ ] Let F = QΛQ T be its spectral decomposition and let R 0000 be a matrix of random numbers generated by the matlab function randn, and define = / 2 For this example, the considered SPD matrix is K N = F + ε T, ε = 0e 5

90 Example Figure: Graph of the size of the entries of the matrix K N

91 Example Figure: Distribution of the eigenvalues of the matrix K N in logarithmic scale

92 Example Figure: Plot of the three dominant eigenvectors of K N

93 Example Figure: Plot of the sequences of δ (+) n (blue dash dotted line), δ ( ) k (green dotted line), η n (red solid line), λ m+ (cyan solid line) and K N A N F (black solid line)

94 Example 2 Table: Largest three eigenvalues of the matrix K N (first column) Largest three eigenvalues computed with downdating procedure with minimal norm, with m = 3 and ñ = 30, 40, 50 (second, third and fourth column), respectively Largest three largest eigenvalues of K N computed with the former downdating procedure (fifth column) λ i µ i, ñ = 30 µ i, ñ = 40 µ i, ñ = 50 µ i, ñ = e e e e e e e e e e e e e e e-6

95 Conclusions A fast algorithm for to compute incrementally the dominant eigenspace of a positive definite matrix

96 Conclusions A fast algorithm for to compute incrementally the dominant eigenspace of a positive definite matrix Improvement on Hoegaerts, L, De Lathauwer, L, Goethals I, Suykens, JAK, Vandewalle, J, & De Moor, B Efficiently updating and tracking the dominant kernel principal components Neural Networks, 20, , 2007

97 Conclusions A fast algorithm for to compute incrementally the dominant eigenspace of a positive definite matrix Improvement on Hoegaerts, L, De Lathauwer, L, Goethals I, Suykens, JAK, Vandewalle, J, & De Moor, B Efficiently updating and tracking the dominant kernel principal components Neural Networks, 20, , 2007 The overall complexity of the incremental updating technique to compute an N m basis matrix U N for the dominant eigenspace of K N, is reduced from (m + 4)N 2 m + O(Nm 3 ) to 6N 2 m + O(Nm 2 )

98 Conclusions A fast algorithm for to compute incrementally the dominant eigenspace of a positive definite matrix Improvement on Hoegaerts, L, De Lathauwer, L, Goethals I, Suykens, JAK, Vandewalle, J, & De Moor, B Efficiently updating and tracking the dominant kernel principal components Neural Networks, 20, , 2007 The overall complexity of the incremental updating technique to compute an N m basis matrix U N for the dominant eigenspace of K N, is reduced from (m + 4)N 2 m + O(Nm 3 ) to 6N 2 m + O(Nm 2 ) When using both incremental updating and downsizing to compute the dominant eigenspace of ˆKñ (an ñ ñ principal submatrix of K N ), the complexity is reduced (2m + 4)Nñm + O(Nm 3 ) to 6Nñm + O(Nm 2 )

99 Conclusions A fast algorithm for to compute incrementally the dominant eigenspace of a positive definite matrix Improvement on Hoegaerts, L, De Lathauwer, L, Goethals I, Suykens, JAK, Vandewalle, J, & De Moor, B Efficiently updating and tracking the dominant kernel principal components Neural Networks, 20, , 2007 The overall complexity of the incremental updating technique to compute an N m basis matrix U N for the dominant eigenspace of K N, is reduced from (m + 4)N 2 m + O(Nm 3 ) to 6N 2 m + O(Nm 2 ) When using both incremental updating and downsizing to compute the dominant eigenspace of ˆKñ (an ñ ñ principal submatrix of K N ), the complexity is reduced (2m + 4)Nñm + O(Nm 3 ) to 6Nñm + O(Nm 2 ) This is in both cases essentially a reduction by a factor m

100 References Gu, Eisenstat, An efficient algorithm for computing a strong rank revealing QR factorization, SIAM SCISC, 996 Chahlaoui, Gallivan, Van Dooren, An incremental method for computing dominant singular subspaces, SIMAX, 200 Hoegaerts, De Lathauwer, Goethals, Suykens, Vandewalle, De Moor, Efficiently updating and tracking the dominant kernel principal components Neural Networks, 2007 Mastronardi, Tyrtishnikov, Van Dooren, A fast algorithm for updating and downsizing the dominant kernel principal components, SIMAX, 200 Baker, Gallivan, Van Dooren, Low-rank incrmenetal methods for computing dominant singular subspaces, submitted, 200 Ipsen, Van Dooren, Polynomial Time Subset Selection Via Updating, in preparation, 200

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES A BLOCK INCREMENTAL ALGORITHM FOR COMPUTING DOMINANT SINGULAR SUBSPACES CHRISTOPHER G.

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES A BLOCK INCREMENTAL ALGORITHM FOR COMPUTING DOMINANT SINGULAR SUBSPACES By CHRISTOPHER G. BAKER A Thesis submitted to the Department of Computer