Optimized LU-decomposition with Full Pivot for Small Batched Matrices S369 Ian Wainwright High Performance Consulting Sweden ian.wainwright@hpcsweden.se
Based on work for GTC 212: 1x speed-up vs multi-threaded Intel MKL on an 4 core (8HT). Background http://www.hpcsweden.se/files/cuda -based_lu_factorization.pdf
Aim To investigate various techniques for batched matrix computations. The motivation behind this work is to use our previous work on LUdecomposition as a use case to investigate optimization techniques for Kepler.
1. LU-decomposition 2. Implementations 3. Performance gains 4. Conclusions Outline
1. LU-decomposition 2. Implementations 3. Performance gains 4. Conclusions LU-decomposition
LU-decomposition The idea is to transform A where Ax=b to an equivalent triangular system such that A=LU a a a a 11 21 31 41 a a a a 12 22 32 42 A L U a a a a 13 23 33 43 a a a a 14 24 34 44 1 l 21 1 = l31 l32 1 * l 41 l 42 l 43 1 u11 u12 u22 u13 u23 u33 u14 u24 u34 u44
Pivot element Eliminate values by subtracting pivot line A LU-decomposition Without Pivot 1 4 7 2 5 8-2/1 Multipliers 3 6 1-3/1 L
Pivot element Eliminate values by subtracting pivot line A LU-decomposition Without Pivot 1 4 7 2 5 8-2/12 Multipliers 3 6 1-3/13 1 4 7-3 -6 2-6 -11 3 L
LU-decomposition A 1 4 7 2 5 8 3 6 1 Without Pivot L 1 4 7-3 -6-6 -11 2 3
Pivot element Eliminate values by subtracting pivot line A LU-decomposition Without Pivot 1 4 7 Multiplier -3-6 2-6 -11-2/1 3 L
Pivot element Eliminate values by subtracting pivot line A 1 4 7-3 -6-6 -11 1 4 7-3 -6 1 LU-decomposition Without Pivot Multiplier -2/12 The multiplier must be shared with all elements of a row Syncronization and sharing in inner loop. L 2 3 2 3 2
LU-decomposition A 1 4 7-3 -6-6 -11 Without Pivot L 2 3 1 4 7-3 -6 1 2 3 2
LU-decomposition A 1 4 7-3 -6-6 -111 Without Pivot L -22-33 2
LU-decomposition U 1 4 7-3 -6 1 Without Pivot L 1-22 1-33 2 1
Pivot element A LU-decomposition 4 7 2 5 8-3 6 1 - Full Pivot
Pivot element A LU-decomposition 1 6 3 8 5 2-8/1 7 4-7/1 Full Pivot Solution: Perform Pivot 1. Find largest value in bottom submatrix. 2. Swap rows and columns to make largest value the pivot element. 3. Keep track of row and column pivot in each step. 4. Then perform operations over rows as usual. LU-decomposition with full pivot is stable
Pivot element A 1 6 3 LU-decomposition Full Pivot Find largest value 8 5 2-8/1 Synchronization and data-sharing 7 4-7/1 Solution: Perform Pivot 1. Find largest value in bottom submatrix. 2. Swap rows and columns to make largest value the pivot element. 3. Keep track of row and column pivot in each step. 4. Then perform operations over rows as usual. LU-decomposition with full pivot is stable Necessitates maxreduction in each column iteration.
Implementations Problem size: Matrices of at most 32 rows or columns of any shape, i.e. both rectangular and square. Batches of 1 matrices. Find L and U for matrix A such that PAQ=LU with full pivoting, where P and Q are the pivot vectors.
Implementations Mapping of problem: Map one matrix to a warp. Map a column to a thread. One or more matrices per block.
Implementations Mapping of problem: Map one matrix to a warp. Map a column to a thread. One or more matrices per block. Block Block 1 Warp Warp t.x t.x 1 t.x 2 t.x 3 One matrix per block t.x t.x 1 t.x 2 t.x 3 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1
Implementations Mapping of problem: Map one matrix to a warp. Map a column to a thread. One or more matrices per block. Block Block Warp Warp 1 t.x t.x 1 t.x 2 t.x 3 Two matrix per block t.x 32 t.x 33 t.x 34 t.x 35 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1
Implementations Mapping of problem: Map one matrix to a warp. Map a column to a thread. The number of matrices per One or more matrices per block. block affects the occupancy. Block Block Warp Warp 1 t.x t.x 1 t.x 2 t.x 3 Two matrix per block t.x 32 t.x 33 t.x 34 t.x 35 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1
Implementations Aim To investigate various techniques for batched matrix computations. Multiple matrices per block: 1. 1 matrix per block. 2. 2 matrices per block. 3. 4 matrices per block. Occupancy Synchronization and data sharing: 1. Shared-mem with syncthreads(). 2. Shared-mem using warpsynchronous programming. 3. Warp shuffle. Synchronization and their interaction Cache-config: 1. Prefer shared. 2. Prefer L1. Cache configuration
Implementations Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration
Performance gains All benchmarks were performed on a standard GTX 68. All numbers are based on execution time for a batch of 1 matrices for various sizes. Execution time is in ms. To save time, we will only be looking at performance numbers for square matrices.
Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration
Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration
1 matrix per block 2 matrices per block 4 matrices per block Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle 5 Execution time, syncthreads() 5 Execution time, syncthreads() 4 4 3 3 2 2 1 1 1 2 3 Matrix size Prefer shared 1 2 3 Matrix size Prefer L1
1 matrix per block 2 matrices per block 4 matrices per block Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle 5 Execution time, syncthreads() 5 Execution time, syncthreads() 4 4 3 3 2 2 1 1 1 2 3 Matrix size Prefer shared 1 2 3 Matrix size Prefer L1
Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration
Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration
1 matrix per block 2 matrices per block 4 matrices per block 3 25 2 15 1 5 Performance gains 1. Execution time, warp-synchronous 1 2 3 Matrix size Prefer shared 3 25 2 15 1 5 Syncthreads() 2. Warp-synchronous 3. Warp shuffle Execution time, warp-synchronous 1 2 3 Matrix size Prefer L1
1 matrix per block 2 matrices per block 4 matrices per block 3 25 2 15 1 5 Performance gains 1. Execution time, warp-synchronous 1 2 3 Matrix size Prefer shared 3 25 2 15 1 5 Syncthreads() 2. Warp-synchronous 3. Warp shuffle Execution time, warp-synchronous 1 2 3 Matrix size Prefer L1
Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration
Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block No surprise: we use shared memory for all datasharing and reduction. More shared memory gives us higher occupancy Better performance. Synchronization Cache configuration
Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration
1 matrix per block 2 matrices per block 4 matrices per block 3 25 2 15 1 5 Performance gains Execution time[ms], syncthreads() 1 2 3 Matrix size syncthreads() 3 25 2 15 1 5 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle Execution time[ms], warp-synchronous() 1 2 3 Matrix size Warp-synchronous
Speed-up 1 matrix per block 2 matrices per block 4 matrices per block 3 25 2 15 1 5 Performance gains syncthreads() 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle Execution time[ms], syncthreads() Warp-sync speed-up over Execution syncthreads() time[ms], warp-synchronous() 2,5 3 2 25 2 1,5 15 1 Roughly 1 a speed-up of 1.7x,5 when using 5 warp-synchronous. 1 2 1 3 2 1 3 2 3 Matrix size Matrix size Matrix size Warp-synchronous
Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration
1 matrix per block 2 matrices per block 4 matrices per block 16 Performance gains 1. Execution time, warp shuffle 16 Syncthreads() 2. Warp-synchronous 3. Warp shuffle Execution time, warp shuffle 12 12 8 8 4 4 8 16 24 32 Matrix size Prefer shared 8 16 24 32 Matrix size Prefer L1
1 matrix per block 2 matrices per block 4 matrices per block 16 Performance gains 1. Execution time, warp shuffle 16 Syncthreads() 2. Warp-synchronous 3. Warp shuffle Execution time, warp shuffle 12 12 8 8 4 4 8 16 24 32 Matrix size Prefer shared 8 16 24 32 Matrix size Prefer L1
Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration
Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration
1 matrix per block 2 matrices per block 4 matrices per block 16 Execution time, warp-synchronous Performance gains 1. 16 Execution time, warp shuffle Syncthreads() 2. Warp-synchronous 3. Warp shuffle 12 12 8 8 4 4 8 16 24 32 Matrix size 8 16 24 32 Matrix size Prefer shared Prefer L1
Speed-up 1 matrix per block 2 matrices per block 4 matrices per block 16 12 8 4 Prefer shared Performance gains 1. Execution time, warp-synchronous Warp shuffle speed-up over Execution warp-synch time, warp shuffle 2 16 Prefer L1 Syncthreads() 2. Warp-synchronous 3. Warp shuffle 1,5 12 1 8,5 Roughly a speed-up 4 of 1.2x when using warp shuffle over warp-sync 8 16 24 32 8 16 24 32 Matrix size 8 16 24 Matrix 32 size Matrix size
Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration
1 matrix per block 2 matrices per block 4 matrices per block Performance gains 1. Execution time, warp shuffle & Prefer L1 Syncthreads() 2. Warp-synchronous 3. Warp shuffle 16 14 12 1 8 6 4 2 8 16 24 32 Matrix size
Speed-up 1 matrix per block 2 matrices per block 4 matrices per block 16 14 12 1 8 6 4 2 Performance gains 1. Execution time, warp shuffle & Prefer L1 2 matrices per block is most consistent. We can gain 2x by mapping 2 or 4 matrices to each block. On average we gain 1.5x. 2,5 2 1,5 1 Speed-up vs 1 block per matrix Syncthreads() 2. Warp-synchronous 3. Warp shuffle,5 8 16 24 32 Matrix size 8 16 24 32 Matrix size
Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration
Conclusions 1. Warp-synchronous is 1.7x faster than syncthreads(). 2. Warp shuffle is 1.2x faster than warp-synchronous. 3. 2 matrices per block is 1.5x faster than 1 matrix per block. 4. Optimal cache config varies with implementation: Prefer shared if you use shared mem. Prefer L1 if you use warp shuffle.
Questions? Ian Wainwright High Performance Consulting Sweden ian.wainwright@hpcsweden.se