Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Size: px

Start display at page:

Download "Enhancing Performance of Tall-Skinny QR Factorization using FPGAs"

Isaac Hamilton
5 years ago
Views:

1 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18

2 Our Claim Common Misconception: Dense linear algebra is always better on GPU We will show that it is true for square matrices but for tall-skinny matrices, FPGA is a better architecture Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 2/18

3 Dense Linear Algebra - Tall-Skinny QR Factorization Stationary Video Background Subtraction = + Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 3/18

4 Dense Linear Algebra - Tall-Skinny QR Factorization Stationary Video Background Subtraction = + Potential Real-time Applications (3 frames/sec) Video Surveillance Traffic Monitoring Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 3/18

5 Dense Linear Algebra - Tall-Skinny QR Factorization Stationary Video Background Subtraction = + Potential Real-time Applications (3 frames/sec) Video Surveillance Traffic Monitoring Robust Method Singular Value Decomposition (SVD) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 3/18

6 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11,592 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18

7 11,592 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11, Q U Σ V T R Threshold these singular values Converge? 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18

8 11,592 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11,592 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) 1 1 Q U Σ V T R Threshold these singular values Converge? 1 CPU-Only 1 min Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18

9 11,592 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11,592 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) 1 1 Q U Σ V T R Threshold these singular values Converge? 1 CPU-Only 1 min GPU + CPU 17 sec Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18

10 11,592 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11,592 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) 1 1 Q U Σ V T R Threshold these singular values Converge? 1 CPU-Only 1 min GPU + CPU 17 sec FPGA + CPU 2-3 sec Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18

11 Some Etreme Aspect Ratios of Tall-Skinny QR Application Aspect Ratio ( m n ) Background Subtraction 1 Least Squares 1 Iterative Solvers 1, Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 5/18

12 What we want to achieve? 2 16 CULA (C25) 2 16 GFLOPs Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 6/18

13 What we want to achieve? 2 CULA (C25) 35% GFLOPs 12 8 Tall-Skinny (m >> n) % Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 6/18

14 What we want to achieve? 2 16 CULA (C25) CAQR (C25) 35% 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) % 4 22% Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 6/18

15 Outline What is QR Factorization? Why Tall-Skinny QR is challenging to parallelize? How Communication-Avoiding QR (CAQR) eposes parallelism? Our Contribution Why GPUs under-perform for CAQR? How we can eploit architectural features of FPGAs to enhance performance? Results Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 7/18

16 QR Factorization - Householder Reflections Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return m k = 1 n P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 8/18

17 QR Factorization - Householder Reflections Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return m k = 2 n P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 8/18

18 QR Factorization - Householder Reflections Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return m k = 3 n P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 8/18

19 QR Factorization - Householder Reflections Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return m k = 3 n P 3 P 2 P 1 A R = P n 1 P n 2 P 1 A Q = P T 1 P T 2 P T n 2P T n 1 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 8/18

20 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18

21 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18

22 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k Processor Processor 1 Processor 2 k = 1 N p = 3 P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18

23 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k T k k log 2 N p Processor Processor 1 Processor 2 k = 1 N p = 3 P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18

24 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k T k k log 2 N p v T k i log 2 N p Processor Processor 1 Processor 2 k = 1 N p = 3 P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18

25 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k T k k log 2 N p v T k i log 2 N p O(nlogN p ) Sync Points Processor Processor 1 Processor 2 P 3 P 2 P 1 A k = 1 N p = 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18

26 Communication-Avoiding QR Factorization Communication is costly compared to computation Divide-and-Conquer Approach- Do as much local computations as possible and avoid communications Only R factors are communicated to the neighbours n R R1 R2 A R1 m A 1 A 2 R2 O(logN p ) Sync Points R3 A 3 Local QR Merge Tree Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18

27 Outline What is QR Factorization? Why Tall-Skinny QR is challenging to parallelise? How Communication-Avoiding QR (CAQR) eposes parallelism? Our Contribution Why GPUs under-perform for CAQR? How we can eploit architectural features of FPGAs to enhance performance? Results Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 11/18

28 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory (~8 cycles latency) A R1 A 1 m A 2 R2 Shared Shared Shared Shared R3 A 3 Local QR Merge Tree SM Registers A SM Registers A 1 SM Registers A 2 SM Registers A 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18

29 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory R (~8 cycles latency) R 1 R 2 R 3 R1 A 1 m R2 Shared Shared Shared Shared A 2 R3 A 3 Local QR Merge Tree SM Registers A SM Registers A 1 SM Registers A 2 SM Registers A 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18

30 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory (~8 cycles latency) R1 A 1 m A 2 R2 Shared Shared Shared Shared R3 A 3 Local QR Merge Tree SM Registers [R ; R 1 ] SM Registers A 1 SM Registers [R 2; R 3 ] SM Registers A 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18

31 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory (~8 cycles latency) R 1 R 11 R1 A 1 m A 2 R2 Shared Shared Shared Shared R3 A 3 Local QR Merge Tree SM Registers [R ; R 1 ] SM Registers A 1 SM Registers [R 2; R 3 ] SM Registers A 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18

32 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory (~8 cycles latency) R1 A 1 m A 2 R2 Shared Shared Shared Shared R3 A 3 Local QR Merge Tree SM Registers [R 1; R 11 ] SM Registers A 1 SM Registers [R 2; R 3 ] SM Registers A 3 Small register file size lead to low arithmetic intensity Global communication in merge stage leads to high lateny Low utilization of SMs as few are active in successive merge stages Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18

33 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

34 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 R3 A 3 A 1 A Local QR Merge Tree Dense Householder QR sel1 Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

35 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree A 3 A 2 Householder QR A 1 sel1 A Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

36 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree A 3 A 2 Householder QR sel1 R R 1 sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

37 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR A 3 sel1 A 2 R R 1 sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

38 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 R 2 R R 3 R 1 sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

39 R 2 R 3 R R 1 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

40 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 R 1 R 11 sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

41 R 1 R 11 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

42 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Large Scratch Pad Memory arithmetic intensity 16 sel1 Dense R 2 sel2 Householder QR Local communication High utilization due to pipeline parallelism Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

43 X X Householder QR Datapath n, 1,, k X O(n) Resources O(n) MemPorts k kt k d 1 d 2 v (1) k = (1) k ± d 2 d 3 β T -2/d i 3 τ k, k X + O(n 2 ) Latency = k DOT PRODUCT τ T k d ' i 4 i + d 5 v i k i d 4 d 5 X X + + O(n) Resources O(n) MemPorts X + O(1) Latency X + AXPY Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 14/18

44 X X Householder QR Datapath n, 1,, k X O(n) Resources O(n) MemPorts k kt k d 1 d 2 v (1) k = (1) k ± d 2 d 3 β T -2/d i 3 τ k, k X + O(n 2 ) Latency = k DOT PRODUCT τ T k d ' i 4 i + d 5 v i k i d 4 d 5 X X + + O(n) Resources O(n) MemPorts X + O(1) Latency X + AXPY Utilizing high memory bandwidth and full resource utilization of FPGA we get 129 Peak GFLOPs (Double-Precision) on Virte6-SX475T Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 14/18

45 Eperimental Setup Nvidia C25 Fermi (515 Peak Double Precision GFLOPs) CULA QR Routine CAQR Routine from UC Berkeley (Thanks to Michael Anderson) Xilin Virte-6 SX475T with Xilin Coregen Library(225 Peak Double Precision GFLOPs ) Synthesizing Tiled Matri Decomposition on FPGAs Tai etal (FPL 211) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 15/18

46 FPGAs Performance vs Matri Width (m = 64) GPU (CULA) 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 16/18

47 FPGAs Performance vs Matri Width (m = 64) GPU (CULA) GPU (CAQR) 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 16/18

48 FPGAs Performance vs Matri Width (m = 64) GPU (CULA) GPU (CAQR) FPGA (Tai etal) 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 16/18

49 FPGAs Performance vs Matri Width (m = 64) GPU (CULA) GPU (CAQR) FPGA (Tai etal) FPGA (Proposed) 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 16/18

50 FPGAs Performance vs Matri Height (n = 51) GPU (CULA) GPU (CAQR) FPGA (Tai etal) FPGA (Proposed) 12 GFLOPs 8 4 Tall-Skinny (m >> n) Number of Rows (m) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 17/18

51 Conclusions The communication-bound tall-skinny QR factorization becomes compute-bound with communication-avoiding linear algebra Tall-skinny QR performance is good on FPGAs compared to GPUs Large scratch pad memory vs Register file Local communication vs Global communication in merge stage High utilization of floating point cores vs low utilization of SMs during the irregular merge stage Comparison 45 speed up over CAQR 77 speed up over previous FPGA work Much higher speed up factors over Intel MKL and CULA QR routines Questions? Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 18/18

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): ) ENHANCING PERFORMANCE OF TALL-SKINNY QR FACTORIZATION USING FPGAS Abid Rafique, Nachiket Kapre and George A. Constantinides Electrical and Electronic Engineering Department Imperial College London London,