Parallel Transposition of Sparse Data Structures

Similar documents
ERLANGEN REGIONAL COMPUTING CENTER

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Parallel Sparse Tensor Decompositions using HiCOO Format

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Presentation of XLIFE++

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

ECS289: Scalable Machine Learning

Strassen s Algorithm for Tensor Contraction

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Dense Arithmetic over Finite Fields with CUMODP

Level-3 BLAS on a GPU

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Cyclops Tensor Framework

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Algorithms and Methods for Fast Model Predictive Control

Toward High Performance Matrix Multiplication for Exact Computation

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Using R for Iterative and Incremental Processing

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Efficient algorithms for symmetric tensor contractions

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Communication-avoiding LU and QR factorizations for multicore architectures

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

Incomplete Cholesky preconditioners that exploit the low-rank property

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

Sparse BLAS-3 Reduction

Sparse solver 64 bit and out-of-core addition

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Direct Self-Consistent Field Computations on GPU Clusters

c 2015 Society for Industrial and Applied Mathematics

Software optimization for petaflops/s scale Quantum Monte Carlo simulations

Some notes on efficient computing and setting up high performance computing environments

Parallel Polynomial Evaluation

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

SOLUTION of linear systems of equations of the form:

Using BLIS for tensor computations in Q-Chem

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Behavioral Simulations in MapReduce

Convergence Models and Surprising Results for the Asynchronous Jacobi Method

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Logo. A Massively-Parallel Multicore Acceleration of a Point Contact Solid Mechanics Simulation DRAFT

Transposition Mechanism for Sparse Matrices on Vector Processors

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Towards parallel bipartite matching algorithms

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

PetaBricks: Variable Accuracy and Online Learning

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

P214 Efficient Computation of Passive Seismic Interferometry

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators

RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

Optimizing Time Integration of Chemical-Kinetic Networks for Speed and Accuracy

Performance optimization of WEST and Qbox on Intel Knights Landing

Introduction to Parallel Programming in OpenMP Dr. Yogish Sabharwal Department of Computer Science & Engineering Indian Institute of Technology, Delhi

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors

One Optimized I/O Configuration per HPC Application

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

A NOVEL PARALLEL QR ALGORITHM FOR HYBRID DISTRIBUTED MEMORY HPC SYSTEMS

Introduction to numerical computations on the GPU

Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Binding Performance and Power of Dense Linear Algebra Operations

arxiv: v1 [hep-lat] 8 Nov 2014

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Transcription:

Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing Department, STFC Rutherford Appleton Laboratory

Sparse Matrix If most elements in a matrix are zeros, we can use sparse representations to store the matrix Compressed Sparse Row (CSR) csrrowptr = 0 2 6 10 15 A = 0 a 0 b 0 0 c d e f 0 0 0 0 g h i j 0 k l m n p csrcolidx = csrval = csccolptr = 1 3 0 1 2 3 2 3 4 5 1 2 3 4 5 a b c d e f g h i j k l m n p Compressed Sparse Column (CSC) 0 1 4 7 11 13 15 cscrowidx = 1 0 1 3 1 2 3 0 1 2 3 2 3 2 3 cscval = c a d k e g l b f h m i n j p 2

Spare Matrix Transposition Sparse matrix transposition from A to A T is an indispensable building block for higher-level algorithms 0 a 0 b 0 0 c d e f 0 0 A = A T = 0 0 g h i j 0 k l m n p csrrowptr = 0 2 6 10 15 csrcolidx = csrval = CSR CSC 1 3 0 1 2 3 2 3 4 5 a b c d e f g h i j 0 c 0 0 a d 0 k 0 e g l b f h m 0 0 i n 0 0 j p 0 1 4 7 11 13 15 1 0 1 3 1 2 3 0 1 2 c a d k e g l b f h 3

Spare Matrix Transposition Sparse transposition has not received the attention like other sparse linear algebra, e.g., SpMV and SpGEMM Transpose A to A T once and then use A T multiple times Sparse transposition is fast enough on modern architectures It is not always true! 4

Driving Cases Sparse transposition is inside the main loop K-truss, Simultaneous Localization and Mapping (SLAM) Or, may occupy significant percentage of execution time Strongly Connected Components (SCC) Examples Description Class Build Blocks K-truss [1] Detect sub graphs where each edge is part of at least k-2 triangles Graph algorithm SpMV, SpGEMM, Transposition SLAM [2] Update an information matrix of an autonomous robot trajectory Motion planning algorithm SpGEMM, Transposition SCC [3] Detect components where every vertex is reachable from every other vertex Graph algorithm Transposition, Set operations [1] J. Wang and J. Cheng, Truss Decomposition in Massive Networks, PVLDB 5(9):812-823, 2012 [2] F. Dellaert and M. Kaess, Square Root SAM: Simultaneous Localization and Mapping via Square Root Information Smoothing, IJRR 25(12):1181-1203, 2006 [3] S. Hong, etc. On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-world Graphs, SC 13, 2013 5

One SpGEMM Execution Time (s) Motivation 4.00 SpTrans and SpGEMM from Intel MKL Sparse BLAS SpGEMM no transposition: C1 = AA SpGEMM_T (with implicit transposition * ): y2 = trans(a)a SpGEMM_T (with explicit transposition): B = trans(a) then C2 = BA 3.50 SpTrans SpGEMM Experiment setup: 3.00 SpGEMM (implicit) SpGEMM (explicit) 1. Sparse matrix is web-google 2.50 2.00 2. Intel Xeon (Haswell) CPU with 28 cores 3. SpGEMM is iterated only one time 1.50 1.00 0.50 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1920 21 22 232425262728 # cores Observations: 1. SpTrans and SpGEMM (implicit) did not scale very well 2. Time spending on SpTrans was close to SpGEMM if multiple cores were used Implicit transposition can use A as an input, but with a hint let higher-level computations operate on A T. Supported by Intel. 6

Outlines Background Motivations Existing Methods Atomic-based Sorting-based Designs ScanTrans MergeTrans Experimental Results Conclusions 7

A = 0 a 0 b 0 0 c d e f 0 0 0 0 g h i j 0 k l m n p Atomic-based Transposition t0 cscval = c a d k e g l b f h m i n j p 1+7 = 8 1. Calculate the offset of each nonzero element in its column, set offset in auxiliary array loc, and count how many nonzero elements in each column Atomic operation fetch_and_add() t1 t2 t3 0 0 0 1 0 1 1 2 0 0 2 2 3 1 1 t0 t1 t2 t3 1 3 3 4 2 2 2. Use prefix-sum to count the start pointer for each column, i.e., csccolptr 3. Scan CSR again to get the position of each nonzero element in cscrowidx and cscval, and move it 4. An additional step, i.e., segmented sort, may be required to guarantee the order in each column Offset of f can be 0, 1, 2, 3, and final position of f can be 7, 8, 9, 10 csccolptr 0 1 4 7 11 13 15 loc 8

A = Sorting-based Transposition (First Two Steps) 0 a 0 b 0 0 c d e f 0 0 0 0 g h i j 0 k l m n p csrrowptr = csrcolidx = csrval = Compressed Sparse Row (CSR) 0 2 6 10 15 1 3 0 1 2 3 2 3 4 5 1 2 3 4 5 a b c d e f g h i j k l m n p 1. Use key-value sort to sort csrcolidx (key) and auxiliary positions (value) csrcolidx = 1 3 0 1 2 3 2 3 4 5 1 2 3 4 5 auxpos = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 key-value sort csrcolidx = 0 1 1 1 2 2 2 3 3 3 3 4 4 5 5 auxpos = 2 0 3 10 4 6 11 1 5 7 12 8 13 9 14 2. Set cscval based on auxpos: cscval = cscval[x] = csrval[auxpos[x]] c a d k e g l b? f h m i n j p For x = 7, cscval[7] =? auxpos[7] = 1 csrval[1] = b 9

Constraints in Existing Methods Atomic-based sparse transposition Contention from the atomic operation fetch_and_add() Additional overhead coming from the segmented sort Sorting-based sparse transposition Performance degradation when the number of nonzero elements increases, due to O(nnz * log(nnz)) complexity 10

Outlines Background Motivations Existing Methods Atomic-based Sorting-based Designs ScanTrans MergeTrans Experimental Results Conclusions 11

Performance Considerations Sparsity independent Performance should not be affected by load imbalance, especially for power-law graphs Avoid atomic operation Avoid the contention of atomic operation Avoid the additional stage of sorting indices inside a row/column Linear complexity The serial method of sparse transposition has the linear complexity O(m + n + nnz) Design parallel methods to achieve closer to it 12

A = 0 a 0 b 0 0 c d e f 0 0 0 0 g h i j 0 k l m n p ScanTrans csrrowptr = 0 2 6 10 15 csrcolidx = csrval = 1 3 0 1 2 3 2 3 4 5 1 2 3 4 5 a b c d e f g h i j k l m n p csrrowidx = 0 0 1 1 1 1 2 2 2 2 3 3 3 3 3 t0 t1 t2 t3 1 histogram inter ((p+1) * n) 0 0 0 0 0 0 1 2 0 1 0 0 0 0 2 2 0 0 0 1 1 0 1 1 0 0 0 1 1 1 intra (nnz) 2 vertical_scan t0 t1 t2 t3 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 2 0 1 0 0 1 2 2 3 0 0 1 3 3 3 1 1 1 3 3 4 2 2 Preprocess: extend csrrowptr to csrrowidx; partition csrval evenly for threads 1. Histogram: count numbers of column indices per thread independently 2. Vertical scan (on inter) 3. Horizontal scan (on last row of inter) 0 3 prefix_sum (csccolptr) 1 4 7 11 13 15 13

A = 0 a 0 b 0 0 c d e f 0 0 0 0 g h i j 0 k l m n p csrcolidx = csrval = ScanTrans csrrowptr = 0 2 6 10 15 1 3 0 1 2 3 2 3 4 5 1 2 3 4 5 a b c d e f g h i j k l m n p csrrowidx = 0 0 1 1 1 1 2 2 2 2 3 3 3 3 3 t0 t1 t2 t3 1 histogram inter ((p+1) * n) 0 0 0 0 0 0 1 2 0 1 0 0 0 0 2 2 0 0 0 1 1 0 1 1 0 0 0 1 1 1 intra (nnz) 2 vertical_scan t0 t1 t2 t3 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 2 0 1 0 0 1 2 2 3 0 0 1 3 3 3 1 1 1 3 3 4 2 2 0 3 prefix_sum (csccolptr) 1 4 7 11 13 15 4 4. Write back cscrowidx off = csccolptr[colidx] + inter[tid*n+colidx] + intra[pos] cscval off = csccolptr[3] + inter[1*6+3] + intra[7] = 7+1+1 = 9 1 0 1 3 1 2 3 0 1 2 c a d k e g l b f h...

Pros Analysis of ScanTrans Two round of Scan on auxiliary arrays to avoid atomic operation Scan operations can be implemented by using SIMD operations Cons Write back step has random memory access on cscval and cscrowidx 15

A = 0 a 0 b 0 0 c d e f 0 0 0 0 g h i j 0 k l m n p MergeTrans csrrowptr = 0 2 6 10 15 csrcolidx = csrval = 1 3 0 1 2 3 2 3 4 5 1 2 3 4 5 a b c d e f g h i j k l m n p csrrowidx = 0 0 1 1 1 1 2 2 2 2 3 3 3 3 3 Preprocess: partition nonzero elements to multiple blocks 1. Each thread transpose one or several blocks to CSC format 2. Merge multiple blocks in parallel until one block left 1 csr2csc_block csccolptr cscrowidx cscval 2 t0 t1 t2 t3 0 1 3 3 4 4 4 1 0 1 0 c a d b 0 0 0 2 4 4 4 1 2 1 2 e g f h t0 t1 multiple round merge 0 0 1 2 2 3 4 3 3 2 2 k l i j 0 0 0 0 1 2 3 3 3 3 m n p t2 t3 16

How to Mitigate Random Memory Access merge_pcsc_mthread csc0 csc1 csccolptr 0 1 3 5 8 8 8 0 0 1 2 3 5 7 cscrowidx 1 0 1 1 2 0 1 2 3 3 3 2 3 2 3 cscval c a d e g b f h k l m i n j p csccolptr cscrowidx cscval 0 1 4 7 11 13 15 1 0 1 3 1 2 3 0 1 2 3 2 3 2 3 c a d k e g l b f h m i n j p Merge two csc to one csc 1. Add two csccolptr directly to get the output csccolptr 2. For each column of output csc, check where the nonzero elements come from; and then move nonzero elements (cscval and cacrowidx) from input csc to output csc Opt: only if two successive columns in both input csc change, we move the data 17

Pros Analysis of MergeTrans Successive memory access on both input csc and output csc Cons Performance is affected by the number of blocks May have much larger auxiliary data (2 * nblocks * (n + 1) + nnz) than ScanTrans and existing methods 18

Implementations and Optimizations SIMD Parallel Prefix-sum Implement prefix-sum on x86-based platforms with Intrinsics Support AVX2, and IMCI/AVX512 Apply on atomic-based method and ScanTrans SIMD Parallel Sort Implement bitonic sort and mergesort on x86-based platform [4] Support AVX, AVX2, and IMCI/AVX512 Apply on sorting-based method Dynamic Scheduling Use OpenMP tasking (since OpenMP 3.0) Apply on sorting-based method and MergeTrans [4] K. Hou, etc. ASPaS: A Framework for Automatic SIMDization of Parallel Sorting on x86-based Many-core Processors, ICS 15, 2015 19

Background Motivations Existing Methods Atomic-based Sorting-based Designs ScanTrans MergeTrans Experimental Results Conclusions Outlines 20

Evaluation & Discussion Experimental Setup (Hardware) Parameter CPU MIC Product Name Intel Xeon E5-2695 v3 Intel Xeon Phi 5110P Code Name Haswell Knights Corner # of Cores 2x14 60 Clock Rate 2.3 GHz 1.05 GHz L1/L2/L3 Cache 32 KB/ 256 KB/ 35 MB 32 KB/ 512 KB/ - Memory 128 GB DDR4 8 GB GDDR5 Compiler icpc 15.3 icpc 15.3 Compiler Options -xcore-avx2 O3 -mmic -O3 Vector ISA AVX2 IMCI 21

Evaluation & Discussion Experimental Setup (Methods) Intel MKL 11.3 mkl_sparse_convert_csr() Atomic-based method (from SCC implementation, SC 13 [3] ) Sorting-based method (from bitonic-sort, ICS 15 [4] ) ScanTrans MergeTrans Dataset 22 matrices: 21 unsymmetric matrices from University of Florida Sparse Matrix Collection + 1 dense matrix Single precision, Double precision, Symbolic (no value) Benchmark Suite Sparse matrix-transpose-matrix addition: A T + A, SpMV: A T * X, and SpGEMM: A T * A (all in explicate mode) Strongly Connected Components: SCC(A) 22

Bandwidth (GB/s) Transposition Performance on Haswell 3.00 Harmonic Mean 1.50 wiki-talk 2.50 MKL_MT Atomic Sorting ScanTrans MergeTrans MKL_MT Atomic Sorting ScanTrans MergeTrans 2.00 1.00 1.50 1.00 0.50 0.50 0.00 Symbolic Single Double 0.00 Symbolic Single Double Compare to Intel MKL method, ScanTrans can achieve an average of 2.8x speedup On wiki-talk, the speedup can be pushed up to 6.2x for double precision 23

Bandwidth (GB/s) Transposition Performance on MIC 0.60 0.50 0.40 MKL_MT Sorting MergeTrans Atomic ScanTrans Harmonic Mean 0.25 0.20 0.15 MKL_MT Sorting MergeTrans Atomic ScanTrans wiki-talk 0.30 0.20 0.10 0.10 0.05 0.00 Symbolic Single Double 0.00 Symbolic Single Double Compare to Intel MKL method, MergeTrans can achieve an average of 3.4x speedup On wiki-talk, the speedup can be pushed up to 11.7x for single precision 24

Execution Time (ms) Execution Time (ms) Higher-level Routines on Haswell 350 300 MKL_MT ScanTrans A T + A 350.00 300.00 MKL_MT ScanTrans A T * x (#iters=50) 250 250.00 200 200.00 150 150.00 100 100.00 50 50.00 0 economics stomach venkat01 web-google 0.00 economics stomach venkat01 web-google 400.00 350.00 MKL_MT ScanTrans A T * A 160 140 MKL_MT ScanTrans SCC(A) 300.00 120 250.00 100 200.00 80 150.00 100.00 60 40 202% 50.00 20 0.00 economics stomach venkat01 web-google 0 economics stomach venkat01 web-google 25

Outlines Background Motivations Existing Methods Atomic-based Sorting-based Designs ScanTrans MergeTrans Experimental Results Conclusions 26

In this paper Conclusions We identify the sparse transposition can be the performance bottleneck We propose two sparse transposition methods: ScanTrans and MergeTrans We evaluate the atomic-based, sorting-based, and Intel MKL methods with ScanTrans and MergeTrans on Intel Haswell CPU and Intel MIC Compare to the vendor-supplied library, ScanTrans can achieve an average of 2.8-fold (up to 6.2-fold) speedup on CPU, and MergeTrans can deliver an average of 3.4-fold (up to 11.7-fold) speedup on MIC Thank you! 27