Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Size: px

Start display at page:

Download "Accelerating Model Reduction of Large Linear Systems with Graphics Processors"

Gyles Short
6 years ago
Views:

1 Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex Technical Systems (Magdeburg, Germany). 2 Centro de Cálculo-Inst. de la Computación,Univ. de la República (Montevideo, Uruguay). 3 Seminar für Angewandte Mathematik, ETHZ (Zürich, Switzerland). 4 Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I (Castellón, Spain). ModRed 10 - December 2010 remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 1

2 Why GPUs? Chapter 1. Introduction Figure 1-1. Floating-Point Operations per Second and Memory Bandwidth for the CPU and GPU 2 CUDA C Programming Guide Version 3.2 Extracted from: CUDA C Programming Guide 3.1, NVIDIA Corporation remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 2

3 GPUs for general purpose programming GPUs where not developed for general programming Difficult and slow code generation A. Remón Model Reduction of Large Linear Systems with GPUs 3

4 GPUs for general purpose programming GPUs where not developed for general programming Difficult and slow code generation CUDA In 2006 appears CUDA (Computed Unified Device Architecture) Created by NVIDIA, is a HW-SW platform to facilitate the use of GPUs in general purpose programming Software: compilers (c, fortran), libraries (cufft, cublas,...) Hardware: efficient thread management and memory access remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 3

5 GPUs for general purpose programming GPUs where not developed for general programming Difficult and slow code generation CUDA Gap between single and double precision performance A. Remón Model Reduction of Large Linear Systems with GPUs 3

6 GPUs for general purpose programming GPUs where not developed for general programming Difficult and slow code generation CUDA Gap between single and double precision performance Fermi In 2010 appears the Fermi architecture The double precision computations are only two times slower than single precision computations A. Remón Model Reduction of Large Linear Systems with GPUs 3

7 GPUs for general purpose programming GPUs where not developed for general programming Difficult and slow code generation CUDA Gap between single and double precision performance Fermi For a list of scientific CUDA applications visit remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 3

8 Outline Model reduction Model reduction via the BT method Matrix sign function method Numerical results A. Remón Model Reduction of Large Linear Systems with GPUs 4

9 Model Reduction: Purpose Given Eẋ(t) = Ax(t) + Bu(t), t > 0, x(0) = x 0, find a reduced model y(t) = Cx(t) + Du(t), t 0, E r x r (t) = A r x r (t) + B r u(t), t > 0, x r (0) = x 0 r, y r (t) = C r x r (t) + D r u(t), t 0, of order r n and output error such that y y r = Gu G r u = (G G r )u y y r and G G r are small! remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 5

Model Reduction: Example Optimal cooling of steel profiles Arises in a manufacturing method for steel profiles. Objective: reduce the temperature as fast as possible.

10 Model Reduction: Example Optimal cooling of steel profiles Arises in a manufacturing method for steel profiles. Objective: reduce the temperature as fast as possible. Method: spraying of cooling fluids on the surface. Goal: Material properties (durability, porosity) have to satisfy quality standards. Problem dimensions: n = 5, 177, m = 7, and p = 6. Math. model: STEEL I from the Oberwolfach benchmark collection Oberwolfach benchmark collection: Model details: [Tröltzsch/Unger 1999/2001], [Penzl 1999] and [Saak 2003]. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 6

11 Outline Model reduction Model reduction via the BT method Matrix sign function method Numerical results A. Remón Model Reduction of Large Linear Systems with GPUs 7

12 Balanced Truncation (BT) method Procedure composed of three steps: 1. Solve the coupled generalized Lyapunov matrix equations AW c E T + EW c A T + BB T = 0, A T W o E + E T W o A + C T C = 0, with W 0 = E T W o E for S, R such that W c = S T S, W o = R T R. 2. Compute [ SR T = UΣV T Σ1 = [ U 1 U 2 ] Σ2 with Σ 1 R rxr, Σ 2 R (n r)x(n r). ] [ V T 1 V T 2 ], remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 8

13 Balanced Truncation method (Cont.) 3. In the last stage T l = Σ 1/2 1 V T 1 R and T r = S T U 1 Σ 1/2 1, and (A r, B r, C r, D r, E r ) = (T l AT r, T l B, CT r, D, T l ET r ). The state-space dimension r of the reducer-order model can be chosen adaptatively as this method provides a realization Ĝ satisfying G G r 2 n j=r+1 σ j. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 9

14 Balanced Truncation method (Cont.) 3. In the last stage T l = Σ 1/2 1 V T 1 R and T r = S T U 1 Σ 1/2 1, and (A r, B r, C r, D r, E r ) = (T l AT r, T l B, CT r, D, T l ET r ). The most expensive computation is the solution of the generalized Lyapunov equations. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 9

15 Outline Model reduction Model reduction via the BT method Matrix sign function method Numerical results A. Remón Model Reduction of Large Linear Systems with GPUs 10

16 Matrix sign function method Remarks It is an efficient tool to solve stable Lyapunov equations. There are different schemes to solve the matrix sign function, like the Newton iteration method. The Newton iteration method for the matrix sign function. A 0 = A, A k+1 = 1 2 (A k + A 1 k ), Main features: Simple. Efficient on parallel implementation. Asymptotic quadratic convergence. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 11

17 Matrix sign function method Low-rank factors version of the algorithm On convergence after j iterations, W c S T S and Wo R T R. Convergence can be accelerated using a scaling factor, in our case: c k = A F / EA 1 k E F. Even if A is sparse, {A k } k=1,2,... in general are full dense matrices. Requires O(n 3 ) floating-point operations per iteration. The most computationally expensive step is the matrix inversion. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 12

18 Matrix sign function method Low-rank factors version of the algorithm Algorithm 1 CGCLNC 1: A 0 = A, Ŝ0 = B T, ˆR 0 = C. 2: k = 0. 3: repeat 4: A k+1 = 1 2 ( Ak /c k + c k (EA k 1 )E ). 5: Compute the rank-revealing QR (RRQR) decomposition [ ] [ ] 1 Sk 2ck, c k Sk (EA 1 Us k )T = Q s Π 0 s 6: S k+1 U s Π s 7: Compute the rank-revealing QR (RRQR) decomposition [ ] [ ] 1 2ck R k, c k ( R k A 1 k )E Ur = Q r Π 0 r 8: R k+1 U r Π r 9: k = k : until A k E 1 < τ A k 1 remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 13

19 Hybrid implementation Computations performed at iteration j 1. PA j = LU * CPU GPU 2. EA j 1 ; Rk A j 1 3. (EA j 1 )E 4. Compute the rank-revealing QR (RRQR) decomposition [ ] [ ] 1 S j, c j S j (EA 1 2cj j ) T Us = Q s Π 0 s 5. S j+1 U s Π s 6. Compute the rank-revealing QR (RRQR) decomposition [ ] [ ] 1 Rj, c j ( R j A 1 Ur 2cj j )E = Q r Π 0 r 7. Rj+1 U r Π r 8. A j+1 = 1 2 ( Aj /c j + c j (EA j 1 )E ) * CPU and GPU cooperate during this operation remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 14

20 Outline Model reduction Model reduction via the BT method Matrix sign function method Numerical results A. Remón Model Reduction of Large Linear Systems with GPUs 15

21 Model reduction Hardware and software Hardware Platform consisting of two Intel Xeon QuadCore E5410 processors at 2.33GHz, connected to an Nvidia Tesla C1060 via a PCI-e bus. Software LAPACK(CPU): all the computations are performed on the CPU using LAPACK and BLAS kernels (MKL v.10.2). Hybrid(CPU+GPU): computations are executed on the most convenient architecture minimizing the communications. (MKL(v.10.2)+CUBLAS(v.2.1)) remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 16

22 Model reduction Problem definition: Optimal cooling of steel profiles Model STEEL I from the Oberwolfach benchmark collection Arises in a manufacturing method for steel profiles. The objective is to design a control that yields moderate temperature gradients when the rail is cooled down. The model corresponds to a 2-D heat equation. Dimensions of the problem: n = 5, 177, m = 7, p = 6 Math. model: [Tröltzsch/Unger 1999/2001], [Penzl 1999] and [Saak 2003]. Oberwolfach benchmark collection: remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 17

23 Model reduction Problem definition: Convective Thermal Flow Problems Model FLOW METER from the Oberwolfach benchmark collection Is a 2-D model of an anemometer-like structure. Mainly consists of a tube and a small heat source. The model is given by a spatially semi-discretized instationary convection difussion equation. The reference temperature is set to 300K and Dirichlet boundary conditions as well as initial conditions are set to 0 with respect to the reference. Dimensions of the problem: n = 9, 669, m = 1, p = 5 Math. model: [Harper 1997], [Ernst 2001] and [Mossmann 2004] Oberwolfach benchmark collection: remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 18

24 Model reduction Results for benchmark STEEL I #Iter. Time (s) Conv. criterion k Hybrid Impl. LAPACK A k + E F e e e e e e e e-07 total: time is reduced on a 45% Problem dimensions: n = 5, 177, m = 7, p = 6. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 19

25 Model reduction Results for benchmark STEEL I (hybrid implementation) #Iter. k PA k = LU EA k 1, R k A k 1 Time (s) (EA k 1 )E S k (EA k 1 ),( R k A k 1 )E, Iteration Compress Accumulated time (s) Problem dimensions: n = 5, 177, m = 7, p = 6. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 20

26 Model reduction Results for benchmark FLOW METER #Iter. Time (s) Conv. criterion k Hybrid Impl. LAPACK A k + E F e e e e e e e e e e e-07 total: time is reduced on a 46% Problem dimensions: n = 9, 669, m = 1, p = 5. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 21

27 Model reduction Results for benchmark Flow Meter (hybrid implementation) #Iter. k PA k = LU EA k 1, R k A k 1 Time (s) (EA k 1 )E S k (EA k 1 ),( R k A k 1 )E, Iteration Compress Accumulated time (s) Problem dimensions: n = 9, 669, m = 1, p = 5. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 22

28 Thanks... Any question? A. Remón Model Reduction of Large Linear Systems with GPUs 23

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems Jos M. Badía 1, Peter Benner 2, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, Gregorio Quintana-Ortí 1, A. Remón 1 1 Depto.