Optimized LU-decomposition with Full Pivot for Small Batched Matrices S PDF Free Download

Optimized LU-decomposition with Full Pivot for Small Batched Matrices S369 Ian Wainwright High Performance Consulting Sweden ian.wainwright@hpcsweden.se

Based on work for GTC 212: 1x speed-up vs multi-threaded Intel MKL on an 4 core (8HT). Background http://www.hpcsweden.se/files/cuda -based_lu_factorization.pdf

Aim To investigate various techniques for batched matrix computations. The motivation behind this work is to use our previous work on LUdecomposition as a use case to investigate optimization techniques for Kepler.

1. LU-decomposition 2. Implementations 3. Performance gains 4. Conclusions Outline

1. LU-decomposition 2. Implementations 3. Performance gains 4. Conclusions LU-decomposition

LU-decomposition The idea is to transform A where Ax=b to an equivalent triangular system such that A=LU a a a a 11 21 31 41 a a a a 12 22 32 42 A L U a a a a 13 23 33 43 a a a a 14 24 34 44 1 l 21 1 = l31 l32 1 * l 41 l 42 l 43 1 u11 u12 u22 u13 u23 u33 u14 u24 u34 u44

Pivot element Eliminate values by subtracting pivot line A LU-decomposition Without Pivot 1 4 7 2 5 8-2/1 Multipliers 3 6 1-3/1 L

Pivot element Eliminate values by subtracting pivot line A LU-decomposition Without Pivot 1 4 7 2 5 8-2/12 Multipliers 3 6 1-3/13 1 4 7-3 -6 2-6 -11 3 L

LU-decomposition A 1 4 7 2 5 8 3 6 1 Without Pivot L 1 4 7-3 -6-6 -11 2 3

Pivot element Eliminate values by subtracting pivot line A LU-decomposition Without Pivot 1 4 7 Multiplier -3-6 2-6 -11-2/1 3 L

Pivot element Eliminate values by subtracting pivot line A 1 4 7-3 -6-6 -11 1 4 7-3 -6 1 LU-decomposition Without Pivot Multiplier -2/12 The multiplier must be shared with all elements of a row Syncronization and sharing in inner loop. L 2 3 2 3 2

LU-decomposition A 1 4 7-3 -6-6 -11 Without Pivot L 2 3 1 4 7-3 -6 1 2 3 2

LU-decomposition A 1 4 7-3 -6-6 -111 Without Pivot L -22-33 2

LU-decomposition U 1 4 7-3 -6 1 Without Pivot L 1-22 1-33 2 1

Pivot element A LU-decomposition 4 7 2 5 8-3 6 1 - Full Pivot

Pivot element A LU-decomposition 1 6 3 8 5 2-8/1 7 4-7/1 Full Pivot Solution: Perform Pivot 1. Find largest value in bottom submatrix. 2. Swap rows and columns to make largest value the pivot element. 3. Keep track of row and column pivot in each step. 4. Then perform operations over rows as usual. LU-decomposition with full pivot is stable

Pivot element A 1 6 3 LU-decomposition Full Pivot Find largest value 8 5 2-8/1 Synchronization and data-sharing 7 4-7/1 Solution: Perform Pivot 1. Find largest value in bottom submatrix. 2. Swap rows and columns to make largest value the pivot element. 3. Keep track of row and column pivot in each step. 4. Then perform operations over rows as usual. LU-decomposition with full pivot is stable Necessitates maxreduction in each column iteration.

Implementations Problem size: Matrices of at most 32 rows or columns of any shape, i.e. both rectangular and square. Batches of 1 matrices. Find L and U for matrix A such that PAQ=LU with full pivoting, where P and Q are the pivot vectors.

Implementations Mapping of problem: Map one matrix to a warp. Map a column to a thread. One or more matrices per block.

Implementations Mapping of problem: Map one matrix to a warp. Map a column to a thread. One or more matrices per block. Block Block 1 Warp Warp t.x t.x 1 t.x 2 t.x 3 One matrix per block t.x t.x 1 t.x 2 t.x 3 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1

Implementations Mapping of problem: Map one matrix to a warp. Map a column to a thread. One or more matrices per block. Block Block Warp Warp 1 t.x t.x 1 t.x 2 t.x 3 Two matrix per block t.x 32 t.x 33 t.x 34 t.x 35 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1

Implementations Mapping of problem: Map one matrix to a warp. Map a column to a thread. The number of matrices per One or more matrices per block. block affects the occupancy. Block Block Warp Warp 1 t.x t.x 1 t.x 2 t.x 3 Two matrix per block t.x 32 t.x 33 t.x 34 t.x 35 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1

Implementations Aim To investigate various techniques for batched matrix computations. Multiple matrices per block: 1. 1 matrix per block. 2. 2 matrices per block. 3. 4 matrices per block. Occupancy Synchronization and data sharing: 1. Shared-mem with syncthreads(). 2. Shared-mem using warpsynchronous programming. 3. Warp shuffle. Synchronization and their interaction Cache-config: 1. Prefer shared. 2. Prefer L1. Cache configuration

Implementations Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

Performance gains All benchmarks were performed on a standard GTX 68. All numbers are based on execution time for a batch of 1 matrices for various sizes. Execution time is in ms. To save time, we will only be looking at performance numbers for square matrices.