Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Similar documents
Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms

Parallel Model Reduction of Large Linear Descriptor Systems via Balanced Truncation

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

Level-3 BLAS on a GPU

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Accelerating linear algebra computations with hybrid GPU-multicore systems.

A Newton-Galerkin-ADI Method for Large-Scale Algebraic Riccati Equations

BALANCING-RELATED MODEL REDUCTION FOR DATA-SPARSE SYSTEMS

Efficient Implementation of Large Scale Lyapunov and Riccati Equation Solvers

Sonderforschungsbereich 393

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

MODEL REDUCTION BY A CROSS-GRAMIAN APPROACH FOR DATA-SPARSE SYSTEMS

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

S N. hochdimensionaler Lyapunov- und Sylvestergleichungen. Peter Benner. Mathematik in Industrie und Technik Fakultät für Mathematik TU Chemnitz

Dense Arithmetic over Finite Fields with CUMODP

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Computing least squares condition numbers on hybrid multicore/gpu systems

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Julian Merten. GPU Computing and Alternative Architecture

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

A CUDA Solver for Helmholtz Equation

On GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Parallel Solution of Large-Scale and Sparse Generalized algebraic Riccati Equations

Passivity Preserving Model Reduction for Large-Scale Systems. Peter Benner.

Controllability and observability gramians parallel computation using GPU

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

GPU accelerated Arnoldi solver for small batched matrix

Model Reduction for Dynamical Systems

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Introduction to numerical computations on the GPU

arxiv: v1 [hep-lat] 7 Oct 2010

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Practical Combustion Kinetics with CUDA

MARCH 24-27, 2014 SAN JOSE, CA

Some notes on efficient computing and setting up high performance computing environments

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Parametrische Modellreduktion mit dünnen Gittern

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

A hybrid Hermitian general eigenvalue solver

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

Saving Energy in Sparse and Dense Linear Algebra Computations

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Factorized Solution of Sylvester Equations with Applications in Control

Perm State University Research-Education Center Parallel and Distributed Computing

On the design of parallel linear solvers for large scale problems

Alfredo Remón Gómez, born on May , in Valencia, SPAIN.

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Solving PDEs with CUDA Jonathan Cohen

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Towards high performance IRKA on hybrid CPU-GPU systems

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

The Lattice Boltzmann Method for Laminar and Turbulent Channel Flows

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Sonderforschungsbereich 393 Parallele Numerische Simulation für Physik und Kontinuumsmechanik

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

Updating incomplete factorization preconditioners for model order reduction

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Low Rank Solution of Data-Sparse Sylvester Equations

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Dynamic Scheduling within MAGMA

Model reduction of coupled systems

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Using AmgX to accelerate a PETSc-based immersed-boundary method code

The geometric mean algorithm

Binding Performance and Power of Dense Linear Algebra Operations

Direct Self-Consistent Field Computations on GPU Clusters

ECS289: Scalable Machine Learning

TECHNISCHE UNIVERSITÄT BERLIN

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

Structure preserving Krylov-subspace methods for Lyapunov equations

Power-Aware Execution of Sparse and Dense Linear Algebra Libraries

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Transcription:

Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex Technical Systems (Magdeburg, Germany). 2 Centro de Cálculo-Inst. de la Computación,Univ. de la República (Montevideo, Uruguay). 3 Seminar für Angewandte Mathematik, ETHZ (Zürich, Switzerland). 4 Depto. de Ingeniería y Ciencia de Computadores, Universidad Jaume I (Castellón, Spain). ModRed 10 - December 2010 remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 1

Why GPUs? Chapter 1. Introduction Figure 1-1. Floating-Point Operations per Second and Memory Bandwidth for the CPU and GPU 2 CUDA C Programming Guide Version 3.2 Extracted from: CUDA C Programming Guide 3.1, NVIDIA Corporation remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 2

GPUs for general purpose programming GPUs where not developed for general programming Difficult and slow code generation remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 3

GPUs for general purpose programming GPUs where not developed for general programming Difficult and slow code generation CUDA In 2006 appears CUDA (Computed Unified Device Architecture) Created by NVIDIA, is a HW-SW platform to facilitate the use of GPUs in general purpose programming Software: compilers (c, fortran), libraries (cufft, cublas,...) Hardware: efficient thread management and memory access remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 3

GPUs for general purpose programming GPUs where not developed for general programming Difficult and slow code generation CUDA Gap between single and double precision performance remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 3

GPUs for general purpose programming GPUs where not developed for general programming Difficult and slow code generation CUDA Gap between single and double precision performance Fermi In 2010 appears the Fermi architecture The double precision computations are only two times slower than single precision computations remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 3

GPUs for general purpose programming GPUs where not developed for general programming Difficult and slow code generation CUDA Gap between single and double precision performance Fermi For a list of scientific CUDA applications visit http://www.nvidia.com/object/cuda_apps_flash_new.html remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 3

Outline Model reduction Model reduction via the BT method Matrix sign function method Numerical results remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 4

Model Reduction: Purpose Given Eẋ(t) = Ax(t) + Bu(t), t > 0, x(0) = x 0, find a reduced model y(t) = Cx(t) + Du(t), t 0, E r x r (t) = A r x r (t) + B r u(t), t > 0, x r (0) = x 0 r, y r (t) = C r x r (t) + D r u(t), t 0, of order r n and output error such that y y r = Gu G r u = (G G r )u y y r and G G r are small! remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 5

Model Reduction: Example Optimal cooling of steel profiles Arises in a manufacturing method for steel profiles. Objective: reduce the temperature as fast as possible. Method: spraying of cooling fluids on the surface. Goal: Material properties (durability, porosity) have to satisfy quality standards. Problem dimensions: n = 5, 177, m = 7, and p = 6. Math. model: STEEL I from the Oberwolfach benchmark collection Oberwolfach benchmark collection: http://www.imtek.de/simulation/benchmark/ Model details: [Tröltzsch/Unger 1999/2001], [Penzl 1999] and [Saak 2003]. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 6

Outline Model reduction Model reduction via the BT method Matrix sign function method Numerical results remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 7

Balanced Truncation (BT) method Procedure composed of three steps: 1. Solve the coupled generalized Lyapunov matrix equations AW c E T + EW c A T + BB T = 0, A T W o E + E T W o A + C T C = 0, with W 0 = E T W o E for S, R such that W c = S T S, W o = R T R. 2. Compute [ SR T = UΣV T Σ1 = [ U 1 U 2 ] Σ2 with Σ 1 R rxr, Σ 2 R (n r)x(n r). ] [ V T 1 V T 2 ], remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 8

Balanced Truncation method (Cont.) 3. In the last stage T l = Σ 1/2 1 V T 1 R and T r = S T U 1 Σ 1/2 1, and (A r, B r, C r, D r, E r ) = (T l AT r, T l B, CT r, D, T l ET r ). The state-space dimension r of the reducer-order model can be chosen adaptatively as this method provides a realization Ĝ satisfying G G r 2 n j=r+1 σ j. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 9

Balanced Truncation method (Cont.) 3. In the last stage T l = Σ 1/2 1 V T 1 R and T r = S T U 1 Σ 1/2 1, and (A r, B r, C r, D r, E r ) = (T l AT r, T l B, CT r, D, T l ET r ). The most expensive computation is the solution of the generalized Lyapunov equations. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 9

Outline Model reduction Model reduction via the BT method Matrix sign function method Numerical results remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 10

Matrix sign function method Remarks It is an efficient tool to solve stable Lyapunov equations. There are different schemes to solve the matrix sign function, like the Newton iteration method. The Newton iteration method for the matrix sign function. A 0 = A, A k+1 = 1 2 (A k + A 1 k ), Main features: Simple. Efficient on parallel implementation. Asymptotic quadratic convergence. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 11

Matrix sign function method Low-rank factors version of the algorithm On convergence after j iterations, W c S T S and Wo R T R. Convergence can be accelerated using a scaling factor, in our case: c k = A F / EA 1 k E F. Even if A is sparse, {A k } k=1,2,... in general are full dense matrices. Requires O(n 3 ) floating-point operations per iteration. The most computationally expensive step is the matrix inversion. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 12

Matrix sign function method Low-rank factors version of the algorithm Algorithm 1 CGCLNC 1: A 0 = A, Ŝ0 = B T, ˆR 0 = C. 2: k = 0. 3: repeat 4: A k+1 = 1 2 ( Ak /c k + c k (EA k 1 )E ). 5: Compute the rank-revealing QR (RRQR) decomposition [ ] [ ] 1 Sk 2ck, c k Sk (EA 1 Us k )T = Q s Π 0 s 6: S k+1 U s Π s 7: Compute the rank-revealing QR (RRQR) decomposition [ ] [ ] 1 2ck R k, c k ( R k A 1 k )E Ur = Q r Π 0 r 8: R k+1 U r Π r 9: k = k + 1. 10: until A k E 1 < τ A k 1 remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 13

Hybrid implementation Computations performed at iteration j 1. PA j = LU * CPU GPU 2. EA j 1 ; Rk A j 1 3. (EA j 1 )E 4. Compute the rank-revealing QR (RRQR) decomposition [ ] [ ] 1 S j, c j S j (EA 1 2cj j ) T Us = Q s Π 0 s 5. S j+1 U s Π s 6. Compute the rank-revealing QR (RRQR) decomposition [ ] [ ] 1 Rj, c j ( R j A 1 Ur 2cj j )E = Q r Π 0 r 7. Rj+1 U r Π r 8. A j+1 = 1 2 ( Aj /c j + c j (EA j 1 )E ) * CPU and GPU cooperate during this operation remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 14

Outline Model reduction Model reduction via the BT method Matrix sign function method Numerical results remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 15

Model reduction Hardware and software Hardware Platform consisting of two Intel Xeon QuadCore E5410 processors at 2.33GHz, connected to an Nvidia Tesla C1060 via a PCI-e bus. Software LAPACK(CPU): all the computations are performed on the CPU using LAPACK and BLAS kernels (MKL v.10.2). Hybrid(CPU+GPU): computations are executed on the most convenient architecture minimizing the communications. (MKL(v.10.2)+CUBLAS(v.2.1)) remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 16

Model reduction Problem definition: Optimal cooling of steel profiles Model STEEL I from the Oberwolfach benchmark collection Arises in a manufacturing method for steel profiles. The objective is to design a control that yields moderate temperature gradients when the rail is cooled down. The model corresponds to a 2-D heat equation. Dimensions of the problem: n = 5, 177, m = 7, p = 6 Math. model: [Tröltzsch/Unger 1999/2001], [Penzl 1999] and [Saak 2003]. Oberwolfach benchmark collection: http://www.imtek.de/simulation/benchmark/ remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 17

Model reduction Problem definition: Convective Thermal Flow Problems Model FLOW METER from the Oberwolfach benchmark collection Is a 2-D model of an anemometer-like structure. Mainly consists of a tube and a small heat source. The model is given by a spatially semi-discretized instationary convection difussion equation. The reference temperature is set to 300K and Dirichlet boundary conditions as well as initial conditions are set to 0 with respect to the reference. Dimensions of the problem: n = 9, 669, m = 1, p = 5 Math. model: [Harper 1997], [Ernst 2001] and [Mossmann 2004] Oberwolfach benchmark collection: http://www.imtek.de/simulation/benchmark/ remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 18

Model reduction Results for benchmark STEEL I #Iter. Time (s) Conv. criterion k Hybrid Impl. LAPACK A k + E F 1 2.958 5.337 8.153664e-02 2 2.618 5.286 6.157084e-03 3 2.650 5.354 1.103795e-03 4 2.732 5.465 3.400846e-04 5 2.955 5.638 1.088081e-04 6 3.486 6.219 2.369416e-05 7 3.946 6.553 2.551781e-06 8 4.442 6.909 1.702591e-07 total: 25.787 46.761 time is reduced on a 45% Problem dimensions: n = 5, 177, m = 7, p = 6. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 19

Model reduction Results for benchmark STEEL I (hybrid implementation) #Iter. k PA k = LU EA k 1, R k A k 1 Time (s) (EA k 1 )E S k (EA k 1 ),( R k A k 1 )E, Iteration Compress 1 0.698 1.041 0.807 0.121 2.958 2 0.544 1.023 0.788 0.047 2.618 3 0.544 1.023 0.788 0.079 2.650 4 0.544 1.023 0.788 0.159 2.732 5 0.543 1.023 0.789 0.381 2.955 6 0.545 1.023 0.788 0.909 3.486 7 0.546 1.022 0.789 1.366 3.946 8 0.543 1.023 0.788 1.866 4.442 Accumulated time (s) 25.787 Problem dimensions: n = 5, 177, m = 7, p = 6. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 20

Model reduction Results for benchmark FLOW METER #Iter. Time (s) Conv. criterion k Hybrid Impl. LAPACK A k + E F 1 17.254 31.516 7.586531e+01 2 16.861 31.580 7.447659e+00 3 16.883 31.725 1.747226e+00 4 16.961 31.970 5.521871e-01 5 17.140 32.126 1.741928e-01 6 17.454 32.329 5.558618e-01 7 17.726 32.525 1.368278e-02 8 17.831 32.842 1.876876e-03 9 17.953 32.896 1.274213e-04 10 18.016 32.997 1.592051e-06 11 17.994 32.881 2.632143e-07 total: 192.217 355.387 time is reduced on a 46% Problem dimensions: n = 9, 669, m = 1, p = 5. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 21

Model reduction Results for benchmark Flow Meter (hybrid implementation) #Iter. k PA k = LU EA k 1, R k A k 1 Time (s) (EA k 1 )E S k (EA k 1 ),( R k A k 1 )E, Iteration Compress 1 3.380 7.741 5.183 0.289 17.359 2 2.906 7.673 5.116 0.109 16.512 3 2.918 7.673 5.116 0.137 16.553 4 2.888 7.673 5.116 0.202 16.592 5 3.007 7.673 5.115 0.359 16.871 6 2.893 7.674 5.116 0.702 17.099 7 2.886 7.673 5.116 0.971 17.365 8 2.890 7.674 5.116 1.066 17.462 9 2.893 7.673 5.117 1.191 17.591 Accumulated time (s) 192.217 Problem dimensions: n = 9, 669, m = 1, p = 5. remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 22

Thanks... Any question? remon@uji.es A. Remón Model Reduction of Large Linear Systems with GPUs 23