Outline. Recursive QR factorization. Hybrid Recursive QR factorization. The linear least squares problem. SMP algorithms and implementations

Size: px

Start display at page:

Download "Outline. Recursive QR factorization. Hybrid Recursive QR factorization. The linear least squares problem. SMP algorithms and implementations"

Octavia McGee
5 years ago
Views:

1 Outline Recursie QR factorization Hybrid Recursie QR factorization he linear least squares problem SMP algorithms and implementations Conclusions

2 Recursion Automatic ariable blocking (e.g., QR factorization) Factorization completed Update completed Fits low leel in memory hierarchy Fits high leel in memory hierarchy. Partition. Factor left hand side 3. Update right hand side 4. Factor right hand side

3 Recursie QR factorization A A A A R Q 0 R R. Diide matrix in two parts (left & right). Factorize left hand side by a recursie call Stopping criteria: if A is a single column, apply a Householder transformation Q R 0?? 3. Update right hand side and factorize by a recursie call A ~R Q Q R A ~ A A

4 Aggregating Householder transformations: Q I - YY ( ) and Y t 0 t t t then, t - I and Q t - I Gien Q wo elementary transformations ( ) Y and Y t 0 t Y then, t I - andq - Y Y I Gien Q One block and one elementary transformation Column by column using Leel operations ( ) Y Y and Y 0 Y Y then, Y Y - I and Q - Y Y I Gien Q wo block transformations Recursiely, block by block using Leel 3 operations

5 RGEQR3 - Recursie algorithm for QR factorization [Y, R, ] RGEQR3 A(:m, :n) if (n ): In practice, Y and R oerwrite A Compute Householder transformation Q I - t u u, such that Q A (x, 0) else return (u, x, t) let n n/ and j n + [Y, R, ] RGEQR3 A(:m, :n )! Recursiely factor left hand side A(:m, j :n) (I - Y Y ) A(:m, j :n)! Update remaining part of A [Y, R, ] RGEQR3 A(j :n, j :n)! Recursiely factor remaining part of A 3 - (Y Y ) end Let R 3 A(:n, j :n) Y Now, return [Y, R, ] 3 3 ( Y Y ), R and R 0 R R 0

6 Some Performance Issues Oerhead from recursion becomes significant as n decreases Cure: Prune the recursion tree - stop recursion at, e.g., n 4 Increasing FLOP count for Q I - Y Y computations prohibits efficient use of pure recursie algorithm for large n Cure: Hybrid recursie algorithm

7 Hybrid recursie algorithm RGEQRF [Y, R, ] RGEQRF A(:m, :n) do j, n, nb jb min(n-j+, nb)! nb is the block size! Factor panel using recursie routine [Y, R, ] RGEQR3 A(j:m, j+jb-) if (j+jb n)! Update remaining part of A A(j:m, j+jb:n) (I - Y Y ) A(j:m, j+jb:n) Relation to LAPACK DGEQRF Leel- factorization of block panels replaced by recursie leel-3 computations Leel- computations of replaced by recursie leel-3 computations in connection with factorization end end Increased performance for block panels and computations Optimal block size is larger Improed performance of leel-3 updates

8 QR: Performance results - 60 MHz Power, m n

9 QR: Performance results - 33 MHz PPC604e, m > n

10 RGELS - Linear least squares routine Sole AX B F X RGELS ( A(:m, :n), B(:n, :nrhs) ) do j, n, nb! nb is the block size jb min(n-j+, nb) Relation to LAPACK DGELS Leel- factorization of block panels replaced by recursie leel-3 computations end! Factor panel using recursie routine [Y, R, ] RGEQR3 A(j:m, j+jb-)! Update remaining part of A & the complete B A(j:m, j+jb:n) (I - Y Y ) A(j:m, j+jb:n) B (I - Y Y ) B leel- computations of replaced by recursie leel-3 computation in connection with factorization (reuse of )! Sole triangular system X R - B A is m x n X is n x nrhs B is m x nrhs

11 RGELS - Additional Cases Sole A X B LAPACK DGELS computes LQ factorization of A and soles remaining triangular system Each Householder transformation is computed on a row of A, i.e., elements are stored with stride LDA RGELS explicitly transposes A and soles AX - B for transposed A F Underdetermined systems: LAPACK DGELS computes minimum norm solution by computing A QR or A LQ Work on RGELS in progress Gains made for oerdetermined systems will generalize to underdetermined case Additional gains can be made by remoing redundant computations in updates A is m x n X is m x nrhs B is n x nrhs

12 RGELS: NRHS AX B F

13 RGELS: transposed, M 50 A X B F

14 SMP-parallel algorithms for matrix factorizations F U F U F 3 U 3 U r- F r. Factor first panel F. Update U & factor F F 3 4 U U U F U Update U & factor F 3 etc Factor F i can start when U i- is completed for that panel Update U i can start when F i is completed and U i- is completed for that panel

15 Parallel RGEQRF Factor first panel HRQR(m, n, A, work,...) if me 0 then call RGEQR3 A(:m,: firstjb) (Y, R, next) end if do while (here is still work enough for me) call GEJOB(j, first, last, jb,, next, Y, nexty, R, dofact,...) A(j:m, first: last) (I - Y Y ) A(j:m, first: last) if dofact then call RGEQR3 A(first:m, first+jb-) (nexty, R, next) end if end while Get a new panel Update Possibly factor Repeat

16 GEJOB - the Pool-of-tasks implementation do while I hae not yet found a new task Enter Critical section If I did factor in my last task: update global ariables If remaining problem is too small for the current # processors then Update global ariables and terminate else Find the next matrix block to update est if it is OK to start working on this block, i.e. test that: - no one is writing on any column in this block - the block I will read is computed - if I will factor: is it safe to oerwrite one of the matrices? If (it is OK to start working on this block) then Update global ariables else Update ariables to show that no block is resered endif endif Leae Critical section enddo Inform about completed job Exit if problem is to small Dependency problems? Find next task Inform about any changes to the pool

17 Parallel performance - 4 processor PPC604e

18 Conclusions Recursion efficiently proides automatic ariable blocking for arbitrary number of leels in a memory hierarchy QR factorization Linear least squares problem Hybrid implementations outperform LAPACK algorithms by around 0% for large square problems up to a factor.9 for tall-thin matrices up to a factor of. - RGELS: AX - B case QR factorization up to a factor of 5 - RGELS: A X - B case Speedups up to 3.97 on 4 processors for parallel QR Serial and parallel QR will be part of the IBM ESSL 3.

19 References E. Elmroth and F. Gustason. A New Much Faster and Simpler Algorithm for LAPACK DGELS. Report UMINF September 000. Submitted to BI. E. Elmroth and F. Gustason. High-Performance Library Software for QR Factorization. In P. Bjørstad et al (eds), Applied Parallel Computing. New Paradigms for HPC in Industry and Academia. Lecture Notes in Computer Science. Springer-Verlag, June 000. (o appear). E. Elmroth and F. Gustason. Applying Recursion to Serial and Parallel QR Factorization Leads to Better Performance. IBM J. Research & Deelopment, Vol. 44, No. 4, 000, pp E. Elmroth and F. Gustason. New Serial and Parallel Recursie QR Factorization Algorithms for SMP Systems. In B. Kågström et al (eds), Applied Parallel Computing. Large Scale Scientific and Industrial Problems. Lecture Notes in Computer Science, No. 54, 998, pp 0-8. Springer-Verlag.

Out-of-Core SVD and QR Decompositions

Out-of-Core SVD and QR Decompositions Eran Rabani and Sivan Toledo 1 Introduction out-of-core singular-value-decomposition algorithm. The algorithm is designed for tall narrow matrices that are too large