Mapping Sparse Matrix-Vector Multiplication on FPGAs

Size: px

Start display at page:

Download "Mapping Sparse Matrix-Vector Multiplication on FPGAs"

Brent Turner
5 years ago
Views:

1 Mapping Sparse Matrix-Vector Multiplication on FPGAs Junqing Sun 1, Gregory Peterson 1, Olaf Storaasli 2 1 University of Tennessee, Knoxville 2 Oak Ridge National Laboratory July 20, 2007

2 Outline Introduction Sparse matrix storage format Basic Design Design for floating points Implementation results Performance analysis

3 Sparse Matrix-Vector Multiplication (SpMxV) y=ab SpMxV on CPUs Introduction Inefficient Optimization algorithms: performance depends on matrix structures SpMxV on FPGAs High throughput achieved for FPGA kernels System Performance affected by I/O and other overheads

4 Sparse Matrix Storage Formats CRS Widely Used An example: Row blocked CRS (RBCRS) Compatible to CRS format Lower I/O requirements A Val: 2, -3, -1, 6, 1, 9, 5, 8, 6 y 0 y 1 y 2 A 10 0 A A 11 A A 21 x Col: 0, 2, 1, 4, 0, 1, 0, 1, 3 Len: 2, 2, 1, 1,3

5 r Basic Design Application Program Matrix Storage Matrix Manager valid col val stall valid col val stall Row ID F I F O 2 PE b ACC Multiplier F I F O 1 Circuit Row ID F I F O 2 PE b ACC Multiplier F I F O 1 Circuit R esu Result Mux Summation Circuit lt Co n trole R esu lt BR Properties Deeply pipelined Common CRS format Data-flow controlled architecture Independent PEs Arbitrary matrix size A M SpMxV

6 Design for Floating Points - Accumulation Circuit - Partial summation circuit Adder Problems in the Data flow Not accumulated to a single value (2, 3, 4, 5, 7 for first row) Different rows are summed up (3+8 => 11) input Wrong output Correct output * The adder has a pipeline of 5 stages

7 Design for floating points - Adder Tree - + r0 F I F O 0 r1 F I F O 1 r2 F I F O 2 r3 F I F O 3 Wen Row ID S S h h Level 0 Level 1 Level Sh i f t e r Adder Tree i i f f t t e e r r Dout Wen Row Result BRAM Level 0 Level 1 Level 2 Level 3 Adder tree using pipelined adders Data flow for adder tree * Final values are automatically captured by the wen signal

8 Design for floating points - Summation Circuit - Result Controller Wen Row ID s h s h i i f f t t e e r r Summation Circuit Dout Wen Row Level 0 Level Level 2 55 Level 3 Reduced summation circuit Data flow for reduced summation circuit * Buffers are used to take the place of expensive adders * Lower throughput * Longer latency

9 Design for floating points - Accumulation Circuit - Comparison of adder tree and summation circuit Lower Cost Design Number of Adders latency Adder Tree Summation Circuit 4 55 Higher Performance

10 Implementation Results Characteristics (8PEs) on XC2VP70-7 Design 64 bit Integer Single FP Double FP Achievable Frequency 175MHz 200MHz 165MHz Slices 8282 (25%) (31%) (72%) BRAMs 36 (10%) 50 (15%) 92 (28%) MULT18X (39%) 32 (9%) 128 (39%) 64 bit v.s. 32/64 bit Mixed Integer Design 32/64 bit Mixed 64 bit 32 bit 32 bit X 64 bit Achievable Frequency 183Mhz 175Mhz Slices 3475 (10%) 8282 (25%) BRAMs 20 (6%) 36 (10%) MULT18X18 32 (9%) 128 (39%) Multiplier Latency 4 cycles 6 cycles Required I/O Bandwidth 8.8GB/s 14GB/s

11 Performance Modeling Required data movements : n IO n Required Floating point operations : n nz p ( N M) n flop 2n nz PEs floating point capability: Computation time : F 2n pe freq T comp Total FloatingPointOperations F 2n F z

12 Execution time: Performance Modeling T max( T comp, T IO ) T init T syn T overhead 2n max( F * nz, n * nz ( val width col B IO width) ) T init T syn T overhead Performance bounded by I/O bandwidth for double floating point design: F 2n T z 2nz B / 5 10n / B Maximum PEs allowed by I/O bandwidth: N B / 5 B PE 2 frequency 10 frequency Cray XD1: 1 PE is needed, 200Mflops peak performance! z

13 Test Matrices ID Matrix Area Size (N) Nonzeros (N nz) Sparsity (%) 1 Crystk02 FEM Crystal Crystk03 FEM Crystal stat96v1 linear programming 5995 x nasasrb Structure analysis raefsky4 Buckling problem ex11 3D steady flow rim FEM fluid mechanics goodwin FEM fluid mechanics dbic1 linear programming x rail4284 Railways * All these matrices come from University of Florida Tim Davis Matrix Collection

14 Overhead Percentage Percentage of Achievable Performance Performance 16.00% 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% Test Matrices % 98.00% 96.00% 94.00% 92.00% 90.00% 88.00% 86.00% 84.00% 82.00% 80.00% 78.00% Test Matrices Overhead Percentage Achievable performance Percentage Simulation Results for XC2VP70

15 Speeup over P4 2.8 GHz Kernel Performance Test Matrices Speed up over 2.8 GHz Pentium 4 * Clock cycle accurate simulation results! * Depends less on sparse structures!

16 Conclusions Architecture and performance modeling for SpMxV on FPGAs Big performance difference for different data formats Up to 20X Speed up over 2.8 GHz Pentium 4 for xc2vp70 Depends less on Sparse structure than CPUs Performance limited by I/O Bandwidth

17 Acknowledgement This project is supported by the University of Tennessee Science Alliance and the ORNL Laboratory Director s Research and Development program. We also would like to thank Richard Barrett of ORNL for useful discussion on sparse matrices.

18 Questions? Thanks!

A Deep Convolutional Neural Network Based on Nested Residue Number System

A Deep Convolutional Neural Network Based on Nested Residue Number System Hiroki Nakahara Tsutomu Sasao Ehime University, Japan Meiji University, Japan Outline Background Deep convolutional neural network