Towards high performance IRKA on hybrid CPU-GPU systems

Size: px

Start display at page:

Download "Towards high performance IRKA on hybrid CPU-GPU systems"

Scarlett Fleming
5 years ago
Views:

Towards high performance IRKA on hybrid CPU-GPU systems Jens Saak in collaboration with Georg Pauer (OVGU/MPI Magdeburg) Kapil Ahuja, Ruchir Garg (IIT Indore) Hartwig

1 Towards high performance IRKA on hybrid CPU-GPU systems Jens Saak in collaboration with Georg Pauer (OVGU/MPI Magdeburg) Kapil Ahuja, Ruchir Garg (IIT Indore) Hartwig Anzt, Jack Dongarra (ICL Uni Tennessee Knoxville) Edmond Chow (Georgia Tech) 17th GAMM Workshop on Applied and Numerical Linear Algebra Cologne, Germany September 7-8, 2017

2 Introduction Projection based Model Order Reduction (MOR) Linear time-invariant (LTI) system: E ẋ(t) = A x(t) + B u(t) y(t) = C x(t) E r = W T EV A r = W T AV B r = W T B C r = CV x(t) R n, x r(t) R r, u R p, y, ỹ R q Reduced order model (ROM): E r ẋ r (t) = A r x r (t) + B r u(t) ỹ(t) = C r x r (t) Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 2/15

3 Introduction Description of Today s Focus Problem MOR of LTI systems often employs the rational Krylov subspace method (RKSM) of some flavor: Moment Matching IRKA Balanced Truncation (V, W span rational Krylov subspaces), (iterative rational Krylov algorithm) (computing the Gramians via ADI or RKSM) Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 3/15

4 Introduction Description of Today s Focus Problem MOR of LTI systems often employs the rational Krylov subspace method (RKSM) of some flavor: Moment Matching IRKA Balanced Truncation (V, W span rational Krylov subspaces), (iterative rational Krylov algorithm) (computing the Gramians via ADI or RKSM) IRKA in a nutshell e.g.[gugercin/antoulas/beattie 08] 1. Guess initial ROM or initial shifts, 2. Compute span V = span{(σ 1 E A) 1 B,..., (σ r E A) 1 B}, span W = span{(σ 1 E T A T ) 1 C T,..., (σ r E T A T ) 1 C T }. 3. Update ROM, set σ = Λ(A r, E r ), repeat until σ has converged. Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 3/15

5 Introduction Description of Today s Focus Problem MOR of LTI systems often employs the rational Krylov subspace method (RKSM) of some flavor: Moment Matching IRKA Balanced Truncation Dominant computation in all these methods Sequences of Shifted Linear Systems (SLS) (V, W span rational Krylov subspaces), (iterative rational Krylov algorithm) (computing the Gramians via ADI or RKSM) (A + σ j E) X j = Y j, for j I N, with real data A R n n, E R n n, Y j R n m, m n, but possibly complex shifts σ j C. Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 3/15

6 Introduction Description of Today s Focus Problem MOR of LTI systems often employs the rational Krylov subspace method (RKSM) of some flavor: Moment Matching IRKA Balanced Truncation Dominant computation in all these methods Sequences of Shifted Linear Systems (SLS) (V, W span rational Krylov subspaces), (iterative rational Krylov algorithm) (computing the Gramians via ADI or RKSM) P j (A + σ j E) X j = P j Y j, for j I N, with real data A R n n, E R n n, Y j R n m, m n, but possibly complex shifts σ j C. Preconditioner P j R n n usually needs update for every j. Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 3/15

7 Introduction Strategies for the switched linear Systems Sequences of Shifted Linear Systems (SLS) P j (A + σ j E) X j = P j Y j, for j I N, We aim at using iterative solvers on hybrid CPU-GPU systems. Efficiency improving strategies 1. recycle / realign preconditioner P j 2. recycle solvers / subspaces 3. solve shifted linear systems in parallel Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 4/15

8 Introduction Strategies for the switched linear Systems Sequences of Shifted Linear Systems (SLS) P j (A + σ j E) X j = P j Y j, for j I N, We aim at using iterative solvers on hybrid CPU-GPU systems. Efficiency improving strategies 1. recycle / realign preconditioner P j 2. recycle solvers / subspaces extending [Anzt/Chow/S./Dongarra 16] 3. solve shifted linear systems in parallel Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 4/15

9 Introduction Strategies for the switched linear Systems Sequences of Shifted Linear Systems (SLS) P j (A + σ j E) X j = P j Y j, for j I N, We aim at using iterative solvers on hybrid CPU-GPU systems. Efficiency improving strategies 1. recycle / realign preconditioner P j 2. recycle solvers / subspaces extending [Anzt/Chow/S./Dongarra 16] following [Ahuja/de Sturler/Gugercin/Chang 12] 3. solve shifted linear systems in parallel Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 4/15

10 Agenda 1. Introduction Projection based Model Order Reduction (MOR) Description of Today s Focus Problem Strategies for the switched linear Systems 2. Reusing and Realigning Preconditioners Prior Work and Our Approach The Real aka. Symmetric Case The Non-Symmetric aka. Complex Case Structure Exploiting Preconditioner Experiments 3. Recycling Solvers and Subspaces Background Preliminary Results 4. Parallel Solution of the Shifted Systems Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 5/15

11 Reusing and Realigning Preconditioners Prior Work and Our Approach 1. reuse preconditioner for small matrix changes, i.e. similar shifts 2. update incomplete factorization preconditioner [Bellavia/De Simone/di Serafino/Morini 11, Benzi/Bertaccini 03, Bertaccini 04, Calgaro/Chehab/Saad 10, Duintjer Tebbens/Tüma 07] Main Issues reuses reduce efficiency of the preconditioner updates have limited potential for parallel execution Goal highly parallel preconditioner update and application Jens Saak, Towards hybrid CPU-GPU IRKA 6/15

12 Reusing and Realigning Preconditioners The Real aka. Symmetric Case [Anzt/Chow/S ] Preconditioned iterative solution of Sequences of Shifted Linear Systems (SLS) P j (A + σ j E) X j = P j Y j, for j I N, for A, E symmetric and σ j real. Idea: Minimize A U T U via fixed point approach [Chow/et al. 15] Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 7/15

13 Reusing and Realigning Preconditioners The Real aka. Symmetric Case [Anzt/Chow/S ] Preconditioned iterative solution of Sequences of Shifted Linear Systems (SLS) P j (A + σ j E) X j = P j Y j, for j I N, for A, E symmetric and σ j real. Idea: Minimize A U T U via fixed point approach [Chow/et al. 15] Initialization If σ j σ j 1 is small enough, then P j P j 1 is small P j 1 good initial guess for P j. Use [Chow/Patel 15, Chow/Anzt/Dongarra 15] for (asynchronous) iterative computation of incomplete factorization preconditioners. Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 7/15

Reusing and Realigning Preconditioners The Real aka. Symmetric Case Time [ms] Speedup Number of sweeps Number of sweeps IC 1 3 5 1 3 5 371 4.88 0.03 0.06 0.08 167.70 87.14 57.48 1 357 9.01 0.03 0.06 0.08 309.

14 Reusing and Realigning Preconditioners The Real aka. Symmetric Case Time [ms] Speedup Number of sweeps Number of sweeps IC Test example: Oberwolfach Collection Steel Cooling Benchmark[Benner/S. 05] here using the FEniCS reimplementation by Maximilian Behr Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 8/15

15 Reusing and Realigning Preconditioners The Non-Symmetric aka. Complex Case Consider Ax = b for A C n n, x, b C n. With A r, A i R n n, x r, x i, b r, b i R n write A = A r + ıa i, x = x r + ıx i, b = b r + ıb i. Idea e.g.[day/heroux 01, Benzi/Bertaccini 08] Rewrite to real linear system [ ] [ ] [ ] Ar A i xr br = A i A r x i b i and employ block preconditioner. Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 9/15

16 Reusing and Realigning Preconditioners Structure Exploiting Preconditioner In our special case A j,r = A + Re(σ j ) E, A j,i = Im(σ j ) E Let Then A j Ṽ j = W j [ ] 0 Im(σ J j = j ). Im(σ j ) 0 [ ] [ ] Aj,r A j,i Vj,r = A j,i A j,r V j,i [ Wj,r W j,i (I 2 A j,r + J j E)Ṽj = W j ] Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 10/15

17 Reusing and Realigning Preconditioners Structure Exploiting Preconditioner In our special case A j,r = A + Re(σ j ) E, A j,i = Im(σ j ) E Let Then A j Ṽ j = W j [ ] 0 Im(σ J j = j ). Im(σ j ) 0 [ ] [ ] Aj,r A j,i Vj,r = A j,i A j,r V j,i [ Wj,r W j,i (I 2 A j,r + J j E)Ṽj = W j ( ) Idea: P := (I 2 A j,r ) 1 = I 2 A 1 j,r ] Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 10/15

18 Reusing and Realigning Preconditioners Structure Exploiting Preconditioner In our special case A j,r = A + Re(σ j ) E, A j,i = Im(σ j ) E Let [ ] counter diagonal blocks 0 Im(σ J j = are: j ). Im(σ j ) 0 Then ± Im(σ j) (A + Re(σ j) E) 1 E = ± Im(σj) Re(σ (A + Re(σj) j) E) 1 (A + Re(σ j) E A) [ ] [ ] [ ] A j Ṽ j = W = ± Im(σj) ( Aj,r A In j j,i (A + Vj,r Re(σ j) E) Wj,r Re(σ j) = 1 A ). A j,i A j,r V j,i W j,i (I 2 A j,r + J j E)Ṽj = W j ( ) Idea: P := (I 2 A j,r ) 1 = I 2 A 1 j,r Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 10/15

19 Reusing and Realigning Preconditioners FEniCS Convect. Diffusion on Unit Square 1 n = 9801 P3-GMRES iterations for SLS (tol=10 8, maxiter=250) ı /10 2 / / ı no prec block ILU(0) block ILU(1) block ILU(2) block ILU(3) block \ full ILU(0) model data by Maximilian Behr Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 11/15

20 Recycling Solvers and Subspaces Recall IRKA needs: span V = span{(σ 1 E A) 1 B,..., (σ k E A) 1 B}, span W = span{(σ 1 E T A T ) 1 C T,..., (σ k E T A T ) 1 C T }. Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 12/15

21 Recycling Solvers and Subspaces Recall IRKA needs: span V = span{(σ 1 E A) 1 B,..., (σ k E A) 1 B}, span W = span{(σ 1 E T A T ) 1 C T,..., (σ k E T A T ) 1 C T }. Idea [Ahuja/de Sturler/Gugercin/Chang 12] Need to solve with dual operators Use BiCG Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 12/15

22 Recycling Solvers and Subspaces Recall IRKA needs: span V = span{(σ 1 E A) 1 B,..., (σ k E A) 1 B}, span W = span{(σ 1 E T A T ) 1 C T,..., (σ k E T A T ) 1 C T }. Idea [Ahuja/de Sturler/Gugercin/Chang 12] Need to solve with dual operators Use BiCG MIMO systems (i.e., p, q > 1) repeatedly solve with different right hand sides update subspace, i.e. recycle Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 12/15

23 Recycling Solvers and Subspaces Recall IRKA needs: span V = span{(σ 1 E A) 1 B,..., (σ k E A) 1 B}, span W = span{(σ 1 E T A T ) 1 C T,..., (σ k E T A T ) 1 C T }. Idea [Ahuja/de Sturler/Gugercin/Chang 12] Need to solve with dual operators Use BiCG MIMO systems (i.e., p, q > 1) repeatedly solve with different right hand sides update subspace, i.e. recycle upon convergence: σ (i) j σ (i+1) j employ deflation, i.e., recycle Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 12/15

24 Recycling Solvers and Subspaces Recall IRKA needs: span V = span{(σ 1 E A) 1 B,..., (σ k E A) 1 B}, span W = span{(σ 1 E T A T ) 1 C T,..., (σ k E T A T ) 1 C T }. Idea [Ahuja/de Sturler/Gugercin/Chang 12] Need to solve with dual operators Use BiCG MIMO systems (i.e., p, q > 1) repeatedly solve with different right hand sides update subspace, i.e. recycle upon convergence: σ (i) j σ (i+1) j employ deflation, i.e., recycle RBiCG for IRKA Current focus with Kapil Ahuja, Ruchir Garg and Georg Pauer: CUDA accelerated version Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 12/15

25 Recycling Solvers and Subspaces Preliminary Results MATLAB results [Ahuja/de Sturler/Gugercin/Chang 12] iteration count time(s) n σ r drop IRKA tol steps BiCG RBiCG Ratio BiCG RBiCG Ratio Test example: Oberwolfach Collection Steel Cooling Benchmark[Benner/S. 05] avilable from: 20Profiles%20% %29 Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 13/15

26 Next: Parallel Solution of the Shifted Systems Observation: Iterative solver needs sparse matrix vector product (A + σ j E)x j = y j. memory bandwidth bounded. Idea: Define: X = [x 1,..., x r ], Y = [y 1,..., y r ] and Σ = diag(σ) Then Y = AX + EXΣ, and sparse matrix vector product sparse-dense matrix matrix product 2r sparse matrix sweeps become 2 sparse matrix sweeps Requires short recurrence method (e.g. BiCG) to limit memory usage Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 14/15

27 Conclusions & Outlook Conclusions Outlook analysis of the problem structure gives rise to 3 promising implementation candidates Preconditioner realignment and Subspace recycling can even be combined proper HPC implementation pending comparison of the strategies Master Thesis Georg Pauer Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 15/15

28 Conclusions & Outlook Conclusions Outlook analysis of the problem structure gives rise to 3 promising implementation candidates Preconditioner realignment and Subspace recycling can even be combined proper HPC implementation pending comparison of the strategies Master Thesis Georg Pauer Thank you for your attention! Jens Saak, saak@mpi-magdeburg.mpg.de Towards hybrid CPU-GPU IRKA 15/15

Updating incomplete factorization preconditioners for model order reduction

DOI 10.1007/s11075-016-0110-2 ORIGINAL PAPER Updating incomplete factorization preconditioners for model order reduction Hartwig Anzt 1 Edmond Chow 2 Jens Saak 3 Jack Dongarra 1,4,5 Received: 18 September