Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Similar documents
1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Improvements for Implicit Linear Equation Solvers

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

October 2003 Trevor Wilcox, M.S. Research Assistant Professor Department of Mechanical Engineering University of Nevada, Las Vegas

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

Accelerating linear algebra computations with hybrid GPU-multicore systems.

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Sparse solver 64 bit and out-of-core addition

Dense Arithmetic over Finite Fields with CUMODP

Parallel sparse direct solvers for Poisson s equation in streamer discharges

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

Solving PDEs with CUDA Jonathan Cohen

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems

Direct Self-Consistent Field Computations on GPU Clusters

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

On the design of parallel linear solvers for large scale problems

Computing least squares condition numbers on hybrid multicore/gpu systems

Algorithms and Methods for Fast Model Predictive Control

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

On the design of parallel linear solvers for large scale problems

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Strassen s Algorithm for Tensor Contraction

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Incomplete Cholesky preconditioners that exploit the low-rank property

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Enhancing Scalability of Sparse Direct Methods

Julian Merten. GPU Computing and Alternative Architecture

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

V C V L T I 0 C V B 1 V T 0 I. l nk

Contents. Preface... xi. Introduction...

Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators

Lattice Quantum Chromodynamics on the MIC architectures

Communication-avoiding LU and QR factorizations for multicore architectures

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Introduction to numerical computations on the GPU

Introduction to Benchmark Test for Multi-scale Computational Materials Software

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS


First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Some notes on efficient computing and setting up high performance computing environments

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Particle Dynamics with MBD and FEA Using CUDA

Scalable and Power-Efficient Data Mining Kernels

11 Parallel programming models

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

arxiv: v1 [hep-lat] 7 Oct 2010

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Practical Combustion Kinetics with CUDA

Fast algorithms for hierarchically semiseparable matrices

Sparse linear solvers

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Parallel Polynomial Evaluation

Direct solution methods for sparse matrices. p. 1/49

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

Scaling and Pivoting in an Out-of-Core Sparse Direct Solver

Efficient algorithms for symmetric tensor contractions

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

On the Computational Complexity of the Discrete Pascal Transform

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

Large Scale Sparse Linear Algebra

Dynamic Scheduling within MAGMA

Parallel Transposition of Sparse Data Structures

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

The Performance Evolution of the Parallel Ocean Program on the Cray X1

A DISTRIBUTED-MEMORY RANDOMIZED STRUCTURED MULTIFRONTAL METHOD FOR SPARSE DIRECT SOLUTIONS

A Column Pre-ordering Strategy for the Unsymmetric-Pattern Multifrontal Method

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

OpenCL: Programming Heterogeneous Architectures

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Parallel Algorithms for Forward and Back Substitution in Direct Solution of Sparse Linear Systems

CAEFEM v9.5 Information

Transcription:

Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com

3D Finite Element Test Problem

LS-DYNA Implicit Half Million Elements T i m i n g i n f o r m a t i o n CPU(seconds) %CPU Clock(seconds) %Clock ---------------------------------------------------------------- Initialization... 2.6513E+01 0.24 2.7598E+01 1.53 Element processing... 8.1698E+01 0.73 7.6144E+01 4.22 Binary databases... 2.0471E+00 0.02 4.8657E-01 0.03 ASCII database... 1.3164E-04 0.00 1.5000E-05 0.00 Contact algorithm... 2.9563E+01 0.26 5.7157E+00 0.32 Interf. ID 1 7.6828E+00 0.07 1.2124E+00 0.07 Interf. ID 2 4.8080E+00 0.04 9.7007E-01 0.05 Interf. ID 3 4.4934E+00 0.04 9.8310E-01 0.05 Interf. ID 4 5.5574E+00 0.05 1.2091E+00 0.07 Interf. ID 5 7.0113E+00 0.06 1.3380E+00 0.07 Contact entities... 0.0000E+00 0.00 0.0000E+00 0.00 Rigid bodies... 8.5901E-04 0.00 3.1600E-04 0.00 Implicit Nonlinear... 8.7434E+00 0.08 6.5756E+00 0.36 Implicit Lin. Alg.... 1.1063E+04 98.67 1.6898E+03 93.55 ---------------------------------------------------------------- T o t a l s 1.1212E+04 100.00 1.8063E+03 100.00 Half million elements, eight Intel Nehalem cores

Major software components All of LS-DYNA Linear Solver Reordering (Metis) & symbolic factorization Redistribute & permute sparse matrix Factor, i.e., for all supernodes: Assemble the frontal matrix Factor the frontal matrix Stack its Schur complement Solve

Time in solver (sec) Eight Nehalem Threads grep WCT mes0000 => WCT: symbolic factorization = 18.952 WCT: matrix redistribution = 5.221 WCT: factorization = 795.354 WCT: numeric solve = 2.705 WCT: total imfslv_mf2 = 826.735 WCT: symbolic factorization = 16.560 WCT: matrix redistribution = 5.087 WCT: factorization = 804.471 WCT: numeric solve = 2.819 WCT: total imfslv_mf2 = 829.281

All of LS-DYNA Start by accelerating factorization Multifrontal Linear Solver Reordering (Metis) & symbolic factorization Redistribute & permute sparse matrix Factor, i.e., for all supernodes Assemble the frontal matrix Factor the frontal matrix with accelerator Stack its Schur complement Solve

Multifontal Method Toy Problem do 4 k = 1, 9 do 1 i = k + 1, 9 a(i, k) = a(i,k) / a(k,k) 1 continue do 3 j = k + 1, 9 do 2 i = k + 1, 9 a(i,j) = a(i,j) 1 a(i,k) * 2 a(k,j) 2 continue 3 continue 4 continue 1 2 3 4 5 6 1 X X X 3 XX X 2 XXX *X* 7 X XX 9 XX X 8 XXX*X* 4 X *X *XX* 5 X XXXX 6 X* X**XX 7 8 9

Multifrontal View of a Toy Matrix 1 4 7 8 4 5 2 5 8 6 3 6 9 2 7 9 4 4 6 5 8 8 6 1 2 4 3 2 6 Multifrontal Method Duff and Reid, ACM TOMS 1983

A Real Problem : Hood Automotive Hood Inner Panel Springback using LS-DYNA

Hood Elimination Tree Each frontal matrix s triangle scaled by operations required to factor it.

Exploiting Concurrency Toy with MPI & OpenMP Processor 0 DOALL 8 4 5 6 DOALL DOALL Processor 1 1 2 4 2 4 5 6 DOALL 3 2 6 7 4 8 9 6 8

Why Accelerators? Tesla vs Nehalem, courtesy of Stan Posey Similar expectations for Intel Phi

Left-Looking Accelerator Factorization of a Frontal Matrix Assemble frontal matrix on host CPU Initialize by sending panel of assembled frontal matrix Only large frontal matrices due to high cost of sending data to and from accelerator

Factor diagonal block Eliminate panels

Eliminate off-diagonal panel Eliminate panels

Fill Upper Triangle

Update Schur Complement Update panels with DGEMM S = S L * U U Assumes vendor math library DGEMM is extremely fast L S

Update Schur Complement Wider panels in Schur complement DGEMM is even faster S = S L * U L U S

Return Entire Frontal Matrix Return error if diagonal of 0.0 encountered or pivot threshold exceeded Otherwise complete frontal matrix is returned Schur complement added to initial values on host CPU

Panels distributed with MPI Factor panels broadcast and sent to GPU. ~100 GPU eliminates factor panel and performs large-rank updates to the Schurcomplement. Load imbalance is major bottleneck, so this needs revisiting 0 0 0 1 1 1

OpenMP on MIC (and on Xeon host) do j = jl, jr do i = jl + 1, ld x = 0.0 do k = jl, j - 1 x = x + s(i, k) * s(k, j) end do s(i, j) = s(i, j) - x end do end do sq = jr jl + 1 c$omp parallel do c$omp& shared (jl, jr, sq, ld, s) c$omp& private (ii, j, i, il, ir) do ii = jl + 1, ld, sq do j = jl, jr il = ii ir = min(ii + sq - 1, ld) do i = il, ir x = 0.0 do k = jl, j - 1 x = x + s(i, k) * s(k, j) end do s(i, j) = s(i, j) - x end do end do end do

Fortran vs CUDA do j = jl, jr do i = jr + 1, ld x = 0.0 do k = jl, j - 1 x = x + s(i, k) * s(k, j) end do s(i, j) = s(i, j) - x end do end do ip=0; for (j = jl; j <= jr; j++) { if(ltid <= (j-1)-jl){ gpulskj(ip+ltid) = s[idxs(jl+ltid,j)]; } ip = ip + (j - 1) jl + 1; } syncthreads(); for (i = jr + 1 + tid; i <= ld; i += GPUL_THREAD_COUNT) { for (j = jl; j <= jr; j++) { gpuls(j-jl,ltid) = s[idxs(i,j)]; } ip=0; for (j = jl; j <= jr; j++) { x = 0.0f; for (k = jl; k <= (j-1); k++) { x = x + gpuls(k-jl,ltid) * gpulskj(ip); ip = ip + 1; } gpuls(j-jl,ltid) -= x; } for (j = jl; j <= jr; j++) { s[idxs(i,j)] = gpuls(j-jl,ltid); } }

Factorization Performance on Individual Frontal Matrices size 5 * size Construct a family of model frontal matrices Eliminate size equations from 5X their number

Multithreaded Host Throughput Factoring a Model Frontal Matrix Intel Nehalem

Host Throughput with MPI Factoring a Model Frontal Matrix Get MPI results from giant For larger frontal matrices, XX Gflop/s have been observed

Hybrid (MPI + OpenMP) Throughput Factoring a Model Frontal Matrix Get hybrid results from giant For larger frontal matrices, XX Gflop/s have been observed

Nehalem & Tesla Throughput Factoring a Model Frontal Matrix For larger frontal matrices, >250 Gflop/s have been observed

Ivy Bridge & Accelerator Factoring a Model Frontal Matrix

GPUs Get the Larger Supernodes Hood s tree Three cylinders benchmark

Relative Time (sec) Nested cylinders benchmarks

Relative Performance (GFlop/s) Nested cylinders benchmarks

Work in Progress Near Term Further accelerator performance optimization Better MPI + accelerators Pivoting on accelerators for numerical stability Longer Term Factor many small frontal matrices on the accelerators Assemble initial values Maintain real stack If the entire matrix fits on the accelerator Forward and back solves Exploit GDRAM memory bandwidth

Research Initially Funded by US JFCOM and AFRL This material is based on research sponsored by the U.S. Joint Forces Command via a contract with the Lockheed Martin Corporation and SimIS, Inc., and on research sponsored by the Air Force Research Laboratory under agreement numbers F30602-02- C-0213 and FA8750-05-2-0204. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government. Approved for public release; distribution is unlimited.

Bonus Slides

GPU Architecture Multiple SIMD cores Multithreaded O(1000) per GPU Banked shared memory 32 Kbytes Tesla 48 Kbytes Fermi Simple thread model Only sync at host Courtesy NVIDIA

Fortran vs CUDA do j = jl, jr do i = jr + 1, ld x = 0.0 do k = jl, j - 1 x = x + s(i, k) * s(k, j) end do s(i, j) = s(i, j) - x end do end do ip=0; for (j = jl; j <= jr; j++) { if(ltid <= (j-1)-jl){ gpulskj(ip+ltid) = s[idxs(jl+ltid,j)]; } ip = ip + (j - 1) jl + 1; } syncthreads(); for (i = jr + 1 + tid; i <= ld; i += GPUL_THREAD_COUNT) { for (j = jl; j <= jr; j++) { gpuls(j-jl,ltid) = s[idxs(i,j)]; } ip=0; for (j = jl; j <= jr; j++) { x = 0.0f; for (k = jl; k <= (j-1); k++) { x = x + gpuls(k-jl,ltid) * gpulskj(ip); ip = ip + 1; } gpuls(j-jl,ltid) -= x; } for (j = jl; j <= jr; j++) { s[idxs(i,j)] = gpuls(j-jl,ltid); } }

GPU gets larger Supernodes

Multicore Performance (i4r4) vs. the Elimination Tree

Relative Performance (sec) half-million elements, in-core MPI, OpenMP, GPU 1 st Factorization LS-DYNA 8 threads 795 1806 8 threads + GPU 283 775 2 MPI x 4 threads 772 1659 2 MPI x 4 + 2 GPU 211 537 Nested cylinders benchmark

Relative Performance (sec) million elements, out-of-core MPI, OpenMP, GPU 1 st Factorization LS-DYNA 8 threads 6131 15562 8 threads + GPU 2828 9143 2 MPI x 4 threads 6248 15287 2 MPI x 4 + 2 GPU 2567 8214 Nested cylinders benchmark

Time in out-of-core solver (sec) Eight threads plus GPU grep WCT mes0000 => WCT: symbolic factorization = 51.764 WCT: matrix redistribution = 10.875 WCT: factorization = 2828.933 WCT: numeric solve = 1409.523 WCT: total imfslv_mf2 = 4316.063 WCT: symbolic factorization = 40.244 WCT: matrix redistribution = 10.685 WCT: factorization = 2679.422 WCT: numeric solve = 1543.953 WCT: total imfslv_mf2 = 4346.260 Note: almost half the elapsed time for LS-DYNA (~4400 sec. out of 9143) is spent in disk I/O.