Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Similar documents
Categories and Subject Descriptors. General Terms. Keywords

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Parallel Circuit Simulation. Alexander Hayman

On the design of parallel linear solvers for large scale problems

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

Dense Arithmetic over Finite Fields with CUMODP

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Solving PDEs with CUDA Jonathan Cohen

Direct Self-Consistent Field Computations on GPU Clusters

Parallel VLSI CAD Algorithms. Lecture 1 Introduction Zhuo Feng

Optimized LU-decomposition with Full Pivot for Small Batched Matrices S3069

Parallel Transposition of Sparse Data Structures

Practical Combustion Kinetics with CUDA

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

A CUDA Solver for Helmholtz Equation

MARCH 24-27, 2014 SAN JOSE, CA

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Review for the Midterm Exam

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Sparse Matrix Computations in Arterial Fluid Mechanics

Computing least squares condition numbers on hybrid multicore/gpu systems

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

On the design of parallel linear solvers for large scale problems

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

Maximum-weighted matching strategies and the application to symmetric indefinite systems

Recent advances in sparse linear solver technology for semiconductor device simulation matrices

Parallel sparse direct solvers for Poisson s equation in streamer discharges

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Binding Performance and Power of Dense Linear Algebra Operations

Sparse linear solvers

Scientific Computing

Feeding of the Thousands. Leveraging the GPU's Computing Power for Sparse Linear Algebra

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

Introduction to numerical computations on the GPU

Parallel Sparse Tensor Decompositions using HiCOO Format

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

CSE 245: Computer Aided Circuit Simulation and Verification

SOLUTION of linear systems of equations of the form:

Level-3 BLAS on a GPU

Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

CRYPTOGRAPHIC COMPUTING

Dynamic Scheduling within MAGMA

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Enhancing Scalability of Sparse Direct Methods

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

A Stochastic-based Optimized Schwarz Method for the Gravimetry Equations on GPU Clusters

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

A Sparse QS-Decomposition for Large Sparse Linear System of Equations

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

arxiv: v1 [hep-lat] 7 Oct 2010

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

On GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations

ECE 407 Computer Aided Design for Electronic Systems. Simulation. Instructor: Maria K. Michael. Overview

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

A hybrid Hermitian general eigenvalue solver

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

An Efficient Solver for Sparse Linear Systems based on Rank-Structured Cholesky Factorization

Improvements for Implicit Linear Equation Solvers

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Statistical Analysis of Random Telegraph Noise in Digital Circuits

Measuring freeze-out parameters on the Bielefeld GPU cluster

Communication-avoiding LU and QR factorizations for multicore architectures

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Efficient Reluctance Extraction for Large-Scale Power Grid with High- Frequency Consideration

Using the Cell Broadband Engine to Compute an Electrochemical Battery Model

Intel Math Kernel Library (Intel MKL) LAPACK

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Julian Merten. GPU Computing and Alternative Architecture

Transcription:

Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University, China Email: chenxm05@mails.tsinghua.edu.cn

Outline Introduction Our Solution Results Conclusions 2

Outline Introduction What is SPICE? Why do we need to accelerate SPICE? Related work Left-looking algorithm Our Solution Results Conclusions 3

What is SPICE? Simulation Program with Integrated Circuit Emphasis Transistor-level integrated circuit simulator SPICE kernel was developed by University of California, Berkeley in 1975 Commercial software: Hspice (Synopsys), Pspice (Cadence), Spectre (Cadence), Eldo (Monter), 4

What is SPICE? SPICE workflow example input equations solve G1 G1 k k e1 I0 G G G G G e 0 1 1 2 3 3 2 0 G3 k G3 G4 k e 3 0 output DC simulation output e1? e2? e 3? Transient simulation output 5

Why do we need to accelerate SPICE? Transient simulation flow Read netlist, build matrix Matrix pre-processing N N Transient iterations Newton-Raphson iterations Iteration update Iteration converged? Y Time node ended? Y Output Model evaluation Sparse LU factorization (A=LU) Right-hand-solving (Ly=b, Ux=y) Solving Ax=b SPICE is very time-consuming! Reason: iterations! Several days (weeks) to run a pre-layout (post-layout) simulation for a modern analog-to-digital/digital-to-analog convertor 6

Why do we need to accelerate SPICE? Transient simulation flow Read netlist, build matrix N N Matrix pre-processing Transient iterations Newton-Raphson iterations Iteration update Iteration converged? Y Time node ended? Y Output Model evaluation Sparse LU factorization (A=LU) Right-hand-solving (Ly=b, Ux=y) Solving Ax=b Models are independent. Parallelization is straightforward on both CPUs and GPUs. Matrix features Unsymmetric Highly sparse Irregular nonzero pattern Ill-conditioned Solving the linear system costs 1/4 to 3/4 of the total time Bottleneck of SPICE! 7

Related work SPARSE (NOT dense) DIRECT (NOT iterative) solvers on GPUs reference matrix type precision average speedup [Christen, unsymmetric single about 2X vs. sequential PARDISO GPPGPU 07] [Krawezik, symmetric double 2.9X vs. 2-threaded ANSYS SAAHPC 09] [Yu, Para. unsymmetric double about 2.5X vs. sequential UMFPACK Comp. 2011] [George, IPDPS 11] symmetric single 7X vs. sequential (double-precision) WSMP 15X (2 GPUs) vs. sequential (double-precision) WSMP [Lucas, Tech. Rep. 2012] [Lucas, HPCCS 11] [Lacoste, Tech. Rep. 2012] [Hogg, Slides 2014] [Kim, IPDPSW 13] symmetric single 1.97X vs. sequential CPU code symmetric single 5.91X vs. sequential CPU code 1.34X vs. 8-threaded CPU code symmetric/unsy single/doubl 2-3X vs. CPU code mmetric e 3-4X (2 GPUs) vs. CPU code symmetric unknown 4.1X vs. CPU code unsymmetric double 15.31X (2 GPUs) vs. sequential CPU code 1.69X (2 GPUs) vs. 12-threaded code 8

Related work All these methods use supernodal or multifrontal algorithms which use CUDA BLAS or other dense kernels Circuit matrices are so sparse that BLAS-based methods are usually inefficient flops For circuit matrices, is usually less than 200 NZ( L U I) 9

Left-looking algorithm Left-looking: using left columns to perform numeric update Vector multiplication-and-add (MAD) b c a a a cb read write Update 10

Left-looking algorithm b Update c a a a cb read write 11

Left-looking algorithm a a cb b c a read write Update 12

Left-looking algorithm a a cb read write Update 13

Left-looking algorithm a a cb 14

Outline Introduction Our Solution Overview Parallelization strategies of LU factorization Workflow Optimization Results Conclusions 15

Overview GPU repeats LU factorization without pivoting for circuit simulation iterations The nonzero patterns of L and U are fixed The pivoting order is also fixed One warp computes one column in a SIMD fashion CPU computes an LU factorization with pivoting ahead of GPU iterations Obtain the nonzero patterns of L and U Obtain the pivoting order Allocate the GPU memory for L and U 16

Parallelization strategies of LU factorization Step 1: construct the task graph (data dependence graph) Column-level data dependence is determined by the nonzero pattern of U in left-looking algorithm 2 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Upper triangular matrix U 3 5 6 1 8 4 7 9 10 Task graph 17

Parallelization strategies of LU factorization Step 2: partition the graph into levels Level definition: the longest distance from any source node to each node Tasks nodes in the same level are completely independent! 2 3 5 6 1 8 4 7 9 level 0 1 2 3 4 Task nodes (columns) 1 2 3 7 4 5 6 8 9 10 10 18

Parallelization strategies of LU factorization Step 3: compute each level that has many task nodes in parallel cluster mode parallel algorithm Equally assign workloads to warps Level synchronization level 0 1 2 3 4 Task nodes (columns) 1 2 3 7 4 5 6 8 9 10 cluster mode warp 1 warp 2 19

Parallelization strategies of LU factorization Step 4: compute other levels in parallel in a pipelinelike fashion Example: 2 warps are computing node 8 and 9 simultaneously. Warp 2 first uses node 6, 4 and 7 to update node 9, but warp 2 must wait until node 8 finishes. 2 3 5 6 1 executable warp 1 warp 2 8 4 7 wait! X 9 10 20

Parallelization strategies of LU factorization Step 4: compute other levels in parallel in a pipelinelike fashion When warp 1 finishes node 8, it begins to compute node 10. Warp 1 first uses node 3 and 8 to update node 10, at the same, warp 2 uses node 8 to update node 9. Warp 1 must wait until node 9 finishes. 2 3 5 6 1 executable 8 4 7 warp 2 warp 1 9 wait! X 10 21

Parallelization strategies of LU factorization Pipeline mode parallel algorithm 22

Workflow A hybrid CPU-GPU workflow 23

Optimization Memory access coalescing Sort the nonzeros in A, L and U by their row indices to improve the data locality > 2X performance gain 24

Outline Introduction Our Solution Results GPU performance Comparison with KLU Comparison with PARDISO Conclusions 25

Results Experimental environment Host: Intel Xeon E5-2690 & 128GB main memory Device: NVIDIA Tesla K40 (ECC off) KLU: a sparse solver specially designed for circuit simulation problems, has only sequential version PARDISO: a parallel sparse solver for generic sparse matrices, from Intel MKL, uses BLAS and LAPACK Benchmarks are from University of Florida Sparse Matrix Collection 26

Results Gflop/s add32 hcircuit rajat23 add20 memplus bcircuit circuit5m scircuit rajat15 rajat24 asic_680k onetone2 asic_680ks freescale1 asic_320k rajat25 asic_100k asic_320ks asic_100ks circuit5m_dc onetone1 g2_circuit twotone Performance K40 peak performance: 1.43 Tflop/s (double-precision) Sparse LU factorization is a memory-intensive problem 14 12 10 8 6 4 2 0 27

Results GB/s add32 hcircuit rajat23 add20 memplus bcircuit circuit5m scircuit rajat15 rajat24 asic_680k onetone2 asic_680ks freescale1 asic_320k rajat25 asic_100k asic_320ks asic_100ks circuit5m_dc onetone1 g2_circuit twotone Global memory bandwidth utilization K40 peak bandwidth (ECC off): 288 GB/s 180 160 140 120 100 80 60 40 20 0 28

Results add32 hcircuit rajat23 add20 memplus bcircuit circuit5m scircuit rajat15 rajat24 asic_680k onetone2 asic_680ks freescale1 asic_320k rajat25 asic_100k asic_320ks asic_100ks circuit5m_dc onetone1 g2_circuit twotone Average speedup vs. KLU: 8.4X (3.0X in geometricmean) GPU time: LU factorization + substitution + data transfer CPU time: LU factorization + substitution 10 9 8 7 6 5 4 3 2 1 0 39.2 10.5 12.8 40.7 47.2 29

Results add32 hcircuit rajat23 add20 memplus bcircuit circuit5m scircuit rajat15 rajat24 asic_680k onetone2 asic_680ks freescale1 asic_320k rajat25 asic_100k asic_320ks asic_100ks circuit5m_dc onetone1 g2_circuit twotone Average speedup vs. KLU: 9.4X (3.6X in geometricmean) GPU time: LU factorization CPU time: LU factorization 10 9 8 7 6 5 4 3 2 1 0 42.9 11.5 15.4 45.0 50.5 30

Results add32 hcircuit rajat23 add20 memplus bcircuit circuit5m scircuit rajat15 rajat24 asic_680k onetone2 asic_680ks freescale1 asic_320k rajat25 asic_100k asic_320ks asic_100ks circuit5m_dc onetone1 g2_circuit twotone Average speedup vs. PARDISO (T=1): 10.6X (4.2 in geometric-mean) GPU time: LU factorization + substitution + data transfer CPU time: LU factorization + substitution 10 9 8 7 6 5 4 3 2 1 0 38.4 24.8 105.6 31

Results add32 hcircuit rajat23 add20 memplus bcircuit circuit5m scircuit rajat15 rajat24 asic_680k onetone2 asic_680ks freescale1 asic_320k rajat25 asic_100k asic_320ks asic_100ks circuit5m_dc onetone1 g2_circuit twotone Average speedup vs. PARDISO (T=8): 2.7X (1.3X in geometric-mean) GPU time: LU factorization + substitution + data transfer CPU time: LU factorization + substitution 10 9 8 7 6 5 4 3 2 1 0 16.4 32

Outline Introduction Our Solution Results Conclusions 33

Conclusions GPU-accelerated sparse LU factorization for circuit simulation problems Nonzero patterns of L and U are obtained by CPU GPU repeats LU factorizations with the same nonzero pattern but different element values Hybrid two-mode scheduling to fit different data dependence No dense kernels used Performance 8.4X vs. KLU 10.6X vs. PARDISO (T=1) 2.7X vs. PARDISO (T=8) 34

Publications [IEEE TPDS] Xiaoming Chen, Ling Ren, Yu Wang, Huazhong Yang, GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling, IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), accepted [DAC 12] Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang, Sparse LU factorization for parallel circuit simulation on GPU, Design Automation Conference (DAC), Jun. 2012, 1125-1130 35

Acknowledgement We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU 36

Thanks for your attention! Questions? Visit http://nicslu.weebly.com for more information about our work on sparse LU factorization for circuit simulation. 37