Optimized LU-decomposition with Full Pivot for Small Batched Matrices S3069

Similar documents
y b where U. matrix inverse A 1 ( L. 1 U 1. L 1 U 13 U 23 U 33 U 13 2 U 12 1

Introduction to Parallel Programming in OpenMP Dr. Yogish Sabharwal Department of Computer Science & Engineering Indian Institute of Technology, Delhi

Review. Example 1. Elementary matrices in action: (a) a b c. d e f = g h i. d e f = a b c. a b c. (b) d e f. d e f.

30.3. LU Decomposition. Introduction. Prerequisites. Learning Outcomes

1.5 Gaussian Elimination With Partial Pivoting.

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn

Solving Systems of Linear Equations

Systems of Linear Equations

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University

Solving Consistent Linear Systems

CS 323: Numerical Analysis and Computing

Review of matrices. Let m, n IN. A rectangle of numbers written like A =

1.Chapter Objectives

Parallel Programming. Parallel algorithms Linear systems solvers

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3

Linear Equations in Linear Algebra

Solution of Linear Systems

CS412: Lecture #17. Mridul Aanjaneya. March 19, 2015

Computational Methods. Systems of Linear Equations

Linear Systems of n equations for n unknowns

Linear Algebra Math 221

1 GSW Sets of Systems

Next topics: Solving systems of linear equations

Linear Algebraic Equations

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

MATHEMATICS FOR COMPUTER VISION WEEK 2 LINEAR SYSTEMS. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

The purpose of computing is insight, not numbers. Richard Wesley Hamming

TMA4125 Matematikk 4N Spring 2017

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Example: Current in an Electrical Circuit. Solving Linear Systems:Direct Methods. Linear Systems of Equations. Solving Linear Systems: Direct Methods

EE5120 Linear Algebra: Tutorial 1, July-Dec Solve the following sets of linear equations using Gaussian elimination (a)

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

Matrix Factorization Reading: Lay 2.5

1 Last time: linear systems and row operations

Exercise Sketch these lines and find their intersection.

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

22A-2 SUMMER 2014 LECTURE 5

Process Model Formulation and Solution, 3E4

2.1 Gaussian Elimination

Matrices and Matrix Algebra.

Solution Set 3, Fall '12

Chapter 4. Solving Systems of Equations. Chapter 4

The Solution of Linear Systems AX = B

Dylan Zwick. Fall Ax=b. EAx=Eb. UxrrrEb

AMS 209, Fall 2015 Final Project Type A Numerical Linear Algebra: Gaussian Elimination with Pivoting for Solving Linear Systems

Y = ax + b. Numerical Applications Least-squares. Start with Self-test 10-1/459. Linear equation. Error function: E = D 2 = (Y - (ax+b)) 2

LU Factorization a 11 a 1 a 1n A = a 1 a a n (b) a n1 a n a nn L = l l 1 l ln1 ln 1 75 U = u 11 u 1 u 1n 0 u u n 0 u n...

I-v k e k. (I-e k h kt ) = Stability of Gauss-Huard Elimination for Solving Linear Systems. 1 x 1 x x x x

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Matrix decompositions

Ack: 1. LD Garcia, MTH 199, Sam Houston State University 2. Linear Algebra and Its Applications - Gilbert Strang

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

Communication-avoiding LU and QR factorizations for multicore architectures

Eliminations and echelon forms in exact linear algebra

Linear Algebra Linear Algebra : Matrix decompositions Monday, February 11th Math 365 Week #4

Lecture 12 (Tue, Mar 5) Gaussian elimination and LU factorization (II)

Numerical Analysis Fall. Gauss Elimination

Linear Systems and LU

Example: 2x y + 3z = 1 5y 6z = 0 x + 4z = 7. Definition: Elementary Row Operations. Example: Type I swap rows 1 and 3

5.7 Cramer's Rule 1. Using Determinants to Solve Systems Assumes the system of two equations in two unknowns

MATRIX DETERMINANTS. 1 Reminder Definition and components of a matrix

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Lecture 4: Linear Algebra 1

3. Vector spaces 3.1 Linear dependence and independence 3.2 Basis and dimension. 5. Extreme points and basic feasible solutions

Computational Linear Algebra

6-3 Solving Systems by Elimination

Matrix notation. A nm : n m : size of the matrix. m : no of columns, n: no of rows. Row matrix n=1 [b 1, b 2, b 3,. b m ] Column matrix m=1

Matrices and systems of linear equations

Numerical Linear Algebra

Matrix Basic Concepts

CPE 310: Numerical Analysis for Engineers

Applied Linear Algebra in Geoscience Using MATLAB

EBG # 3 Using Gaussian Elimination (Echelon Form) Gaussian Elimination: 0s below the main diagonal

Introduction to Mathematical Programming

Origami: Folding Warps for Energy Efficient GPUs

Matrix decompositions

System of Linear Equations. Slide for MA1203 Business Mathematics II Week 1 & 2

Matrices and Vectors

Relationships Between Planes

Math 552 Scientific Computing II Spring SOLUTIONS: Homework Set 1

Linear System of Equations

Hani Mehrpouyan, California State University, Bakersfield. Signals and Systems

4.3 Row operations. As we have seen in Section 4.1 we can simplify a system of equations by either:

Gaussian Elimination and Back Substitution

MODEL ANSWERS TO THE FIRST QUIZ. 1. (18pts) (i) Give the definition of a m n matrix. A m n matrix with entries in a field F is a function

Shan Feng

MTH501- Linear Algebra MCQS MIDTERM EXAMINATION ~ LIBRIANSMINE ~

Binding Performance and Power of Dense Linear Algebra Operations

Solving Linear Systems Using Gaussian Elimination

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Fundamentals Algorithms for Linear Equation Solution. Gaussian Elimination LU Factorization

CSE 160 Lecture 13. Numerical Linear Algebra

22.4. Numerical Determination of Eigenvalues and Eigenvectors. Introduction. Prerequisites. Learning Outcomes

(Mathematical Operations with Arrays) Applied Linear Algebra in Geoscience Using MATLAB

Solution of Linear systems

Scientific Computing

Revised Simplex Method

Lecture 12. Linear systems of equations II. a 13. a 12. a 14. a a 22. a 23. a 34 a 41. a 32. a 33. a 42. a 43. a 44)

Transcription:

Optimized LU-decomposition with Full Pivot for Small Batched Matrices S369 Ian Wainwright High Performance Consulting Sweden ian.wainwright@hpcsweden.se

Based on work for GTC 212: 1x speed-up vs multi-threaded Intel MKL on an 4 core (8HT). Background http://www.hpcsweden.se/files/cuda -based_lu_factorization.pdf

Aim To investigate various techniques for batched matrix computations. The motivation behind this work is to use our previous work on LUdecomposition as a use case to investigate optimization techniques for Kepler.

1. LU-decomposition 2. Implementations 3. Performance gains 4. Conclusions Outline

1. LU-decomposition 2. Implementations 3. Performance gains 4. Conclusions LU-decomposition

LU-decomposition The idea is to transform A where Ax=b to an equivalent triangular system such that A=LU a a a a 11 21 31 41 a a a a 12 22 32 42 A L U a a a a 13 23 33 43 a a a a 14 24 34 44 1 l 21 1 = l31 l32 1 * l 41 l 42 l 43 1 u11 u12 u22 u13 u23 u33 u14 u24 u34 u44

Pivot element Eliminate values by subtracting pivot line A LU-decomposition Without Pivot 1 4 7 2 5 8-2/1 Multipliers 3 6 1-3/1 L

Pivot element Eliminate values by subtracting pivot line A LU-decomposition Without Pivot 1 4 7 2 5 8-2/12 Multipliers 3 6 1-3/13 1 4 7-3 -6 2-6 -11 3 L

LU-decomposition A 1 4 7 2 5 8 3 6 1 Without Pivot L 1 4 7-3 -6-6 -11 2 3

Pivot element Eliminate values by subtracting pivot line A LU-decomposition Without Pivot 1 4 7 Multiplier -3-6 2-6 -11-2/1 3 L

Pivot element Eliminate values by subtracting pivot line A 1 4 7-3 -6-6 -11 1 4 7-3 -6 1 LU-decomposition Without Pivot Multiplier -2/12 The multiplier must be shared with all elements of a row Syncronization and sharing in inner loop. L 2 3 2 3 2

LU-decomposition A 1 4 7-3 -6-6 -11 Without Pivot L 2 3 1 4 7-3 -6 1 2 3 2

LU-decomposition A 1 4 7-3 -6-6 -111 Without Pivot L -22-33 2

LU-decomposition U 1 4 7-3 -6 1 Without Pivot L 1-22 1-33 2 1

Pivot element A LU-decomposition 4 7 2 5 8-3 6 1 - Full Pivot

Pivot element A LU-decomposition 1 6 3 8 5 2-8/1 7 4-7/1 Full Pivot Solution: Perform Pivot 1. Find largest value in bottom submatrix. 2. Swap rows and columns to make largest value the pivot element. 3. Keep track of row and column pivot in each step. 4. Then perform operations over rows as usual. LU-decomposition with full pivot is stable

Pivot element A 1 6 3 LU-decomposition Full Pivot Find largest value 8 5 2-8/1 Synchronization and data-sharing 7 4-7/1 Solution: Perform Pivot 1. Find largest value in bottom submatrix. 2. Swap rows and columns to make largest value the pivot element. 3. Keep track of row and column pivot in each step. 4. Then perform operations over rows as usual. LU-decomposition with full pivot is stable Necessitates maxreduction in each column iteration.

Implementations Problem size: Matrices of at most 32 rows or columns of any shape, i.e. both rectangular and square. Batches of 1 matrices. Find L and U for matrix A such that PAQ=LU with full pivoting, where P and Q are the pivot vectors.

Implementations Mapping of problem: Map one matrix to a warp. Map a column to a thread. One or more matrices per block.

Implementations Mapping of problem: Map one matrix to a warp. Map a column to a thread. One or more matrices per block. Block Block 1 Warp Warp t.x t.x 1 t.x 2 t.x 3 One matrix per block t.x t.x 1 t.x 2 t.x 3 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1

Implementations Mapping of problem: Map one matrix to a warp. Map a column to a thread. One or more matrices per block. Block Block Warp Warp 1 t.x t.x 1 t.x 2 t.x 3 Two matrix per block t.x 32 t.x 33 t.x 34 t.x 35 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1

Implementations Mapping of problem: Map one matrix to a warp. Map a column to a thread. The number of matrices per One or more matrices per block. block affects the occupancy. Block Block Warp Warp 1 t.x t.x 1 t.x 2 t.x 3 Two matrix per block t.x 32 t.x 33 t.x 34 t.x 35 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1 1 3 4 2 3 7 4 9 7 5 16 4 2 3 4 1

Implementations Aim To investigate various techniques for batched matrix computations. Multiple matrices per block: 1. 1 matrix per block. 2. 2 matrices per block. 3. 4 matrices per block. Occupancy Synchronization and data sharing: 1. Shared-mem with syncthreads(). 2. Shared-mem using warpsynchronous programming. 3. Warp shuffle. Synchronization and their interaction Cache-config: 1. Prefer shared. 2. Prefer L1. Cache configuration

Implementations Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

Performance gains All benchmarks were performed on a standard GTX 68. All numbers are based on execution time for a batch of 1 matrices for various sizes. Execution time is in ms. To save time, we will only be looking at performance numbers for square matrices.

Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

1 matrix per block 2 matrices per block 4 matrices per block Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle 5 Execution time, syncthreads() 5 Execution time, syncthreads() 4 4 3 3 2 2 1 1 1 2 3 Matrix size Prefer shared 1 2 3 Matrix size Prefer L1

1 matrix per block 2 matrices per block 4 matrices per block Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle 5 Execution time, syncthreads() 5 Execution time, syncthreads() 4 4 3 3 2 2 1 1 1 2 3 Matrix size Prefer shared 1 2 3 Matrix size Prefer L1

Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

1 matrix per block 2 matrices per block 4 matrices per block 3 25 2 15 1 5 Performance gains 1. Execution time, warp-synchronous 1 2 3 Matrix size Prefer shared 3 25 2 15 1 5 Syncthreads() 2. Warp-synchronous 3. Warp shuffle Execution time, warp-synchronous 1 2 3 Matrix size Prefer L1

1 matrix per block 2 matrices per block 4 matrices per block 3 25 2 15 1 5 Performance gains 1. Execution time, warp-synchronous 1 2 3 Matrix size Prefer shared 3 25 2 15 1 5 Syncthreads() 2. Warp-synchronous 3. Warp shuffle Execution time, warp-synchronous 1 2 3 Matrix size Prefer L1

Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block No surprise: we use shared memory for all datasharing and reduction. More shared memory gives us higher occupancy Better performance. Synchronization Cache configuration

Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

1 matrix per block 2 matrices per block 4 matrices per block 3 25 2 15 1 5 Performance gains Execution time[ms], syncthreads() 1 2 3 Matrix size syncthreads() 3 25 2 15 1 5 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle Execution time[ms], warp-synchronous() 1 2 3 Matrix size Warp-synchronous

Speed-up 1 matrix per block 2 matrices per block 4 matrices per block 3 25 2 15 1 5 Performance gains syncthreads() 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle Execution time[ms], syncthreads() Warp-sync speed-up over Execution syncthreads() time[ms], warp-synchronous() 2,5 3 2 25 2 1,5 15 1 Roughly 1 a speed-up of 1.7x,5 when using 5 warp-synchronous. 1 2 1 3 2 1 3 2 3 Matrix size Matrix size Matrix size Warp-synchronous

Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

1 matrix per block 2 matrices per block 4 matrices per block 16 Performance gains 1. Execution time, warp shuffle 16 Syncthreads() 2. Warp-synchronous 3. Warp shuffle Execution time, warp shuffle 12 12 8 8 4 4 8 16 24 32 Matrix size Prefer shared 8 16 24 32 Matrix size Prefer L1

1 matrix per block 2 matrices per block 4 matrices per block 16 Performance gains 1. Execution time, warp shuffle 16 Syncthreads() 2. Warp-synchronous 3. Warp shuffle Execution time, warp shuffle 12 12 8 8 4 4 8 16 24 32 Matrix size Prefer shared 8 16 24 32 Matrix size Prefer L1

Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

1 matrix per block 2 matrices per block 4 matrices per block 16 Execution time, warp-synchronous Performance gains 1. 16 Execution time, warp shuffle Syncthreads() 2. Warp-synchronous 3. Warp shuffle 12 12 8 8 4 4 8 16 24 32 Matrix size 8 16 24 32 Matrix size Prefer shared Prefer L1

Speed-up 1 matrix per block 2 matrices per block 4 matrices per block 16 12 8 4 Prefer shared Performance gains 1. Execution time, warp-synchronous Warp shuffle speed-up over Execution warp-synch time, warp shuffle 2 16 Prefer L1 Syncthreads() 2. Warp-synchronous 3. Warp shuffle 1,5 12 1 8,5 Roughly a speed-up 4 of 1.2x when using warp shuffle over warp-sync 8 16 24 32 8 16 24 32 Matrix size 8 16 24 Matrix 32 size Matrix size

Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

1 matrix per block 2 matrices per block 4 matrices per block Performance gains 1. Execution time, warp shuffle & Prefer L1 Syncthreads() 2. Warp-synchronous 3. Warp shuffle 16 14 12 1 8 6 4 2 8 16 24 32 Matrix size

Speed-up 1 matrix per block 2 matrices per block 4 matrices per block 16 14 12 1 8 6 4 2 Performance gains 1. Execution time, warp shuffle & Prefer L1 2 matrices per block is most consistent. We can gain 2x by mapping 2 or 4 matrices to each block. On average we gain 1.5x. 2,5 2 1,5 1 Speed-up vs 1 block per matrix Syncthreads() 2. Warp-synchronous 3. Warp shuffle,5 8 16 24 32 Matrix size 8 16 24 32 Matrix size

Performance gains Occupancy 4 matrices per block 2 matrices per block 1 matrix per block Synchronization Cache configuration

Conclusions 1. Warp-synchronous is 1.7x faster than syncthreads(). 2. Warp shuffle is 1.2x faster than warp-synchronous. 3. 2 matrices per block is 1.5x faster than 1 matrix per block. 4. Optimal cache config varies with implementation: Prefer shared if you use shared mem. Prefer L1 if you use warp shuffle.

Questions? Ian Wainwright High Performance Consulting Sweden ian.wainwright@hpcsweden.se