On a Parallel Implementation of the One-Sided Block Jacobi SVD Algorithm

Size: px

Start display at page:

Download "On a Parallel Implementation of the One-Sided Block Jacobi SVD Algorithm"

Timothy Small
5 years ago
Views:

1 Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons On a Parallel Implementaton of the One-Sded Block Jacob SVD Algorthm Gabrel Okša 1, Martn Bečka, 1 Marán Vateršc 2 1 Insttute of Mathematcs Slovak Academy of Scences Bratslava, Slovaka 2 Insttute of Scentfc Computng Unversty of Salzburg Salzburg, Austra Harrachov 2007, Czech Republc, August 19-25, 2007 Gabrel Okša: Jacob SVD 1/16

2 Outlne Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons 1 formulaton 2 One-Sded Block-Jacob Algorthm 3 Acceleratng 4 Parallelzaton 5 Conclusons Gabrel Okša: Jacob SVD 2/16

3 Our task Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons Compute n parallel the Sngular Value Decomposton (SVD) of a complex matrx A of the sze m n, m n: ( ) Σ A = U V H., 0 where U(m m) and V (n n) are orthogonal and Σ = dag(σ ) wth σ 1 σ 2 σ n. Numercally stable way of computaton: one- or two-sded block-jacob methods; large degree of parallelsm. Target archtecture: dstrbuted memory machnes (parallel supercomputers and clusters) wth Message Passng Interface (MPI). Gabrel Okša: Jacob SVD 3/16

4 Man structure of Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons A R m n, m n, ts block-column parttonng A = [A 1, A 2,..., A r ], n = (r 1)n 0 + n r, n r n 0. The OSBJA can be wrtten as an teratve process: A (0) = A, V (0) = I n, A (k+1) = A (k) U (k), V (k+1) = V (k) U (k), k 0, (1) where the n n OG matrx U (k) (block rotaton): I U (k) U (k) U (k) = I. (2) U (k) U (k) A (k) U (k) n (1) mutually orthogonalzes the columns between A (k) and A (k). I Gabrel Okša: Jacob SVD 4/16

5 Pvot Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons The orthogonal matrx Û (k) = ( U (k) U (k) U (k) U (k) of order n + n s called the pvot submatrx of U (k) at step k. At each step k of, the pvot par (, ) s chosen accordng to a gven pvot F : {0, 1,...} P r = {(l, m) : 1 l < m r}. Well-known strateges: cyclc row, cyclc column. ) Gabrel Okša: Jacob SVD 5/16

6 3 parts of one (seral) step Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons Part 1: The pvot par (, ) s gven, compute the symmetrc, postve sem-defnte cross-product matrx ( (k)t A A (k) A (k)t A (k) ) Â (k) = [A (k) A (k) ] T [A (k) A (k) ] = A (k)t A (k) A (k)t Complexty: m (n + n )(n + n 1)/2 flops. When A (k). dagonal blocks of Â(k) are dagonal, ths reduces to m (n n + n + n ) flops. Part 2: Â(k) s dagonalzed by ts egenvalue decomposton: Û (k)t Â (k) Û (k) (k) = ˆΛ, Complexty: on average 8(n + n ) 3 flops (Har 2005). Gabrel Okša: Jacob SVD 6/16

7 3 parts of one (seral) step (cont.) Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons Part 3: Û(k) defnes, after ts parttonng, U (k) n (2), whch s used for updates of A (k) and V (k) n (1). Complexty: 2m(n + n ) 2 flops. Overall complexty: One step of requres: N flop (k) m(n n + n + n ) + 8(n + n ) 3 + 2m(n + n ) 2 flops. = 64n (9n n 0)n (f m = n = n 0 r) If we can choose the block wdth small enough, all computatons of Part 2 can be performed n cache snce Â(k) s of order only n + n. The most complex computaton s n Part 3 (matrx products). How to accelerate t? Gabrel Okša: Jacob SVD 7/16

8 Matrx preprocessng Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons Step 1: QR factorzaton wth column pvotng followed by the LQ factorzaton of the R-factor: AP = Q 1 R and R = LQ2 T. The OSBJA s then appled to the L-factor. Advantages: 1 The off-dagonal norm of the L-factor s concentrated near the man dagonal and the Jacob method usng specal orderngs can be much faster and accurate (Har 2005). 2 Rght sngular vectors can be computed a posteror: LV = UΣ, where UΣ s the fnal stage of the Jacob method appled to L (Drmač 1999). V s not updated durng teratons! Then the SVD of A: A = Q 1 LQ2 T PT = (Q 1 U)Σ(PQ 2 V ) T (postprocessng). Complexty: 2mn n 3 flops + work n the Jacob teratons. Gabrel Okša: Jacob SVD 8/16

9 Step 2: Specal ntalzaton of the recurson of three matrces. We have: L = [L 1, L 2,..., L r ]. Compute: 1. for = 1 : r 2. Â (0) = L T L ; 3. Â (0) = Q (0) 4. (A (0) = L Q (0) 5. end; Γ (0) Q (0)T ; (spectral decomposton) ); (not performed, ust the connecton) Intalze three block matrces wth blocks of small orders n, n : B (0) = [B (0) 1, B(0) 2,..., B(0) r ] = [L 1, L 2,..., L r ], Q (0) = dag(q (0) 11, Q(0) 22,..., Q(0) rr ), Γ (0) = dag(γ (0) 1, Γ(0) 2,..., Γ(0) r ) (stored as vector). Complexty: for n = n 0 r about (n/4 + 8n 0 )n 0 n flops (assumng 4 sweeps n step 3).

10 Recurson k k + 1 Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons 1. Compute the cross-product matrx: ( ) Q (k) T ( (k)t B B (k) Â (k) = = ( Γ (k) Ã (k)t Q (k) Ã (k) Γ (k) ) B (k)t B (k), where Ã(k) B (k)t B (k)t Q (k)t B (k) B (k) ) ( (B (k)t Q (k) Q (k) B (k) )Q (k). 2. Compute the egendecomposton: Â(k) = Û(k) ˆΛ(k) Û (k)t. Copy the egenvalues nto approprate postons of Γ (k+1). 3. Compute the CS decomposton: ( ) ( V (k) (k) C Û (k) = V (k) ˆV (k) ˆT (k) Ŵ (k)t. S (k) S (k) C (k) ) ( W (k) W (k) ) T Gabrel Okša: Jacob SVD 10/16 )

11 Recurson k k + 1 (cont.) Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons 4. Compute new matrces B and Q: (B (k+1), B (k+1) ) = (B (k) Q (k+1) (Q (k) V (k) ), B (k) (Q (k) V (k) = W (k)t, Q (k+1) = W (k)t. ) ˆT (k), Man dea: the large dmenson m s totally elmnated from recurson. Matrx multplcatons are of the form XY, where X s of order n n or n n, and Y s square of order n or n. Fnal update of B and B : specal structure of ˆT (k) (rotatons of columns of length n). Overall complexty (Har 2005): N flop (k) 82n (3n n 0)n. Gabrel Okša: Jacob SVD 11/16

12 Stoppng crteron Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons Let: Â (k) = A (k)t A (k) = Ã (k) 11 Ã (k) 1r D k = (dag(â(k) )) 1/2 = dag( A (k) e 1,..., A (k) e n ) = (Γ (k) ) 1/2.. Ã (k)t 1r Ã (k) rr Â (k) S = D 1 k Â (k) D 1 k (scaled matrx). Two measures of convergence:, 2 α k off(â(k) S ) = 2 Â(k) S I F, r r ω k off(â(k) ) = Ã(k) t 2 F. =1 t=+1 Computaton of α k : n 3 /2 flops. Gabrel Okša: Jacob SVD 12/16

13 Stoppng crteron (cont.) Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons Stoppng crteron: α k n 2 ɛ. At the end of block sweep t + 1: (t+1)n 1 ω(t+1)n 2 = ω2 tn k=tn Ã(k) (k),(k) 2 F, N = r(r 1)/2. Easy update of ω 2, but a severe cancellaton may occur. Assumng the quadratc convergence of ω for t t 0, montor: ν tn (t+1)n 1 Ã(k) (k),(k) 2 F = ω tn + O(ωtN 2 ), t t 0. k=tn When ν tn n γ ((t+1)n) 1 ɛ wth γ ((t+1)n) 1 = max[γ ((t+1)n) ], compute α (t+1)n. Gabrel Okša: Jacob SVD 13/16

14 Parallel mplementaton Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons Data layout: p processors, two block columns per processor: r = 2p. Preprocessng: The QRFCP and the LQF va the ScaLAPACK s routne PDGEQPF and PDGELQF, respectvely. Each processor computes ts local cross-product matrx and two spectral decompostons of ts dagonal blocks Â(0) ll. Recurson: 1 Choose some parallel block orderng (e.g., round-robn). 2 Each processor stores two block columns and matrx blocks B, B, Q, Q, and two vectors γ, γ. In the worst case, ths amount of data must be transferred between processors at the begnnng of each parallel teraton step (= r/2 subtasks) accordng to the orderng. 3 All computatons are local (LAPACK), no global communcaton. Gabrel Okša: Jacob SVD 14/16

15 Parallel mplementaton (cont.) Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons Stoppng crteron: Its mplementaton requres the global communcaton. Update of ω 2 and ν: Local computaton of the squared Frobenus norm of each nullfed matrx block n each processor, then the global sum of local squares n PE 1, and, fnally, the broadcast of an updated value to all processors: routnes MPI_ALLREDUCE, MPI_ALLGATHER and MPI_BCAST from the ScaLAPACK. Computaton of α: Scalng of the columns and rows of B by the values stored n vector γ, the square of Frobenus norm, the broadcast of new α: routnes MPI_ALLGATHERV, MPI_ALLREDUCE, MPI_ALLGATHER, MPI_BCAST. Gabrel Okša: Jacob SVD 15/16

16 Conclusons Jacob SVD Gabrel Okša formulaton One-Sded Block-Jacob Algorthm Acceleratng Parallelzaton Conclusons Matrx preprocessng: Concentraton of the Frobenus norm towards man dagonal leads to a substantal reducton of the number Jacob sweeps needed for the convergence. Workng wth matrx blocks: Enables to accelerate the computaton even n the seral case usng a fast cache memory. Three-matrx recurson: Elmnates the large dmenson m, all updates can be made n a fast cache memory. Parallelzaton: Addtonal speed-up possble, especally for large n (number of columns). Gabrel Okša: Jacob SVD 16/16

Parallel One-Sided Block Jacobi SVD Algorithm: I. Analysis and Design

Parallel One-Sided Block Jacobi SVD Algorithm: I. Analysis and Design Parallel One-Sded Block Jacob SVD Algorthm: I. Analyss and Desgn Gabrel Okša a Marán Vateršc a Mathematcal Insttute, Department of Informatcs, Slovak Academy of Scences, Bratslava, Slovak Republc Techncal