Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Size: px
Start display at page:

Download "Enhancing Performance of Tall-Skinny QR Factorization using FPGAs"

Transcription

1 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18

2 Our Claim Common Misconception: Dense linear algebra is always better on GPU We will show that it is true for square matrices but for tall-skinny matrices, FPGA is a better architecture Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 2/18

3 Dense Linear Algebra - Tall-Skinny QR Factorization Stationary Video Background Subtraction = + Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 3/18

4 Dense Linear Algebra - Tall-Skinny QR Factorization Stationary Video Background Subtraction = + Potential Real-time Applications (3 frames/sec) Video Surveillance Traffic Monitoring Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 3/18

5 Dense Linear Algebra - Tall-Skinny QR Factorization Stationary Video Background Subtraction = + Potential Real-time Applications (3 frames/sec) Video Surveillance Traffic Monitoring Robust Method Singular Value Decomposition (SVD) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 3/18

6 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11,592 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18

7 11,592 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11, Q U Σ V T R Threshold these singular values Converge? 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18

8 11,592 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11,592 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) 1 1 Q U Σ V T R Threshold these singular values Converge? 1 CPU-Only 1 min Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18

9 11,592 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11,592 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) 1 1 Q U Σ V T R Threshold these singular values Converge? 1 CPU-Only 1 min GPU + CPU 17 sec Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18

10 11,592 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11,592 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) 1 1 Q U Σ V T R Threshold these singular values Converge? 1 CPU-Only 1 min GPU + CPU 17 sec FPGA + CPU 2-3 sec Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18

11 Some Etreme Aspect Ratios of Tall-Skinny QR Application Aspect Ratio ( m n ) Background Subtraction 1 Least Squares 1 Iterative Solvers 1, Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 5/18

12 What we want to achieve? 2 16 CULA (C25) 2 16 GFLOPs Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 6/18

13 What we want to achieve? 2 CULA (C25) 35% GFLOPs 12 8 Tall-Skinny (m >> n) % Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 6/18

14 What we want to achieve? 2 16 CULA (C25) CAQR (C25) 35% 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) % 4 22% Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 6/18

15 Outline What is QR Factorization? Why Tall-Skinny QR is challenging to parallelize? How Communication-Avoiding QR (CAQR) eposes parallelism? Our Contribution Why GPUs under-perform for CAQR? How we can eploit architectural features of FPGAs to enhance performance? Results Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 7/18

16 QR Factorization - Householder Reflections Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return m k = 1 n P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 8/18

17 QR Factorization - Householder Reflections Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return m k = 2 n P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 8/18

18 QR Factorization - Householder Reflections Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return m k = 3 n P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 8/18

19 QR Factorization - Householder Reflections Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return m k = 3 n P 3 P 2 P 1 A R = P n 1 P n 2 P 1 A Q = P T 1 P T 2 P T n 2P T n 1 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 8/18

20 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18

21 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18

22 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k Processor Processor 1 Processor 2 k = 1 N p = 3 P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18

23 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k T k k log 2 N p Processor Processor 1 Processor 2 k = 1 N p = 3 P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18

24 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k T k k log 2 N p v T k i log 2 N p Processor Processor 1 Processor 2 k = 1 N p = 3 P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18

25 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k T k k log 2 N p v T k i log 2 N p O(nlogN p ) Sync Points Processor Processor 1 Processor 2 P 3 P 2 P 1 A k = 1 N p = 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18

26 Communication-Avoiding QR Factorization Communication is costly compared to computation Divide-and-Conquer Approach- Do as much local computations as possible and avoid communications Only R factors are communicated to the neighbours n R R1 R2 A R1 m A 1 A 2 R2 O(logN p ) Sync Points R3 A 3 Local QR Merge Tree Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18

27 Outline What is QR Factorization? Why Tall-Skinny QR is challenging to parallelise? How Communication-Avoiding QR (CAQR) eposes parallelism? Our Contribution Why GPUs under-perform for CAQR? How we can eploit architectural features of FPGAs to enhance performance? Results Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 11/18

28 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory (~8 cycles latency) A R1 A 1 m A 2 R2 Shared Shared Shared Shared R3 A 3 Local QR Merge Tree SM Registers A SM Registers A 1 SM Registers A 2 SM Registers A 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18

29 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory R (~8 cycles latency) R 1 R 2 R 3 R1 A 1 m R2 Shared Shared Shared Shared A 2 R3 A 3 Local QR Merge Tree SM Registers A SM Registers A 1 SM Registers A 2 SM Registers A 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18

30 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory (~8 cycles latency) R1 A 1 m A 2 R2 Shared Shared Shared Shared R3 A 3 Local QR Merge Tree SM Registers [R ; R 1 ] SM Registers A 1 SM Registers [R 2; R 3 ] SM Registers A 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18

31 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory (~8 cycles latency) R 1 R 11 R1 A 1 m A 2 R2 Shared Shared Shared Shared R3 A 3 Local QR Merge Tree SM Registers [R ; R 1 ] SM Registers A 1 SM Registers [R 2; R 3 ] SM Registers A 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18

32 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory (~8 cycles latency) R1 A 1 m A 2 R2 Shared Shared Shared Shared R3 A 3 Local QR Merge Tree SM Registers [R 1; R 11 ] SM Registers A 1 SM Registers [R 2; R 3 ] SM Registers A 3 Small register file size lead to low arithmetic intensity Global communication in merge stage leads to high lateny Low utilization of SMs as few are active in successive merge stages Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18

33 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

34 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 R3 A 3 A 1 A Local QR Merge Tree Dense Householder QR sel1 Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

35 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree A 3 A 2 Householder QR A 1 sel1 A Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

36 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree A 3 A 2 Householder QR sel1 R R 1 sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

37 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR A 3 sel1 A 2 R R 1 sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

38 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 R 2 R R 3 R 1 sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

39 R 2 R 3 R R 1 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

40 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 R 1 R 11 sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

41 R 1 R 11 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

42 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Large Scratch Pad Memory arithmetic intensity 16 sel1 Dense R 2 sel2 Householder QR Local communication High utilization due to pipeline parallelism Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18

43 X X Householder QR Datapath n, 1,, k X O(n) Resources O(n) MemPorts k kt k d 1 d 2 v (1) k = (1) k ± d 2 d 3 β T -2/d i 3 τ k, k X + O(n 2 ) Latency = k DOT PRODUCT τ T k d ' i 4 i + d 5 v i k i d 4 d 5 X X + + O(n) Resources O(n) MemPorts X + O(1) Latency X + AXPY Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 14/18

44 X X Householder QR Datapath n, 1,, k X O(n) Resources O(n) MemPorts k kt k d 1 d 2 v (1) k = (1) k ± d 2 d 3 β T -2/d i 3 τ k, k X + O(n 2 ) Latency = k DOT PRODUCT τ T k d ' i 4 i + d 5 v i k i d 4 d 5 X X + + O(n) Resources O(n) MemPorts X + O(1) Latency X + AXPY Utilizing high memory bandwidth and full resource utilization of FPGA we get 129 Peak GFLOPs (Double-Precision) on Virte6-SX475T Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 14/18

45 Eperimental Setup Nvidia C25 Fermi (515 Peak Double Precision GFLOPs) CULA QR Routine CAQR Routine from UC Berkeley (Thanks to Michael Anderson) Xilin Virte-6 SX475T with Xilin Coregen Library(225 Peak Double Precision GFLOPs ) Synthesizing Tiled Matri Decomposition on FPGAs Tai etal (FPL 211) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 15/18

46 FPGAs Performance vs Matri Width (m = 64) GPU (CULA) 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 16/18

47 FPGAs Performance vs Matri Width (m = 64) GPU (CULA) GPU (CAQR) 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 16/18

48 FPGAs Performance vs Matri Width (m = 64) GPU (CULA) GPU (CAQR) FPGA (Tai etal) 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 16/18

49 FPGAs Performance vs Matri Width (m = 64) GPU (CULA) GPU (CAQR) FPGA (Tai etal) FPGA (Proposed) 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 16/18

50 FPGAs Performance vs Matri Height (n = 51) GPU (CULA) GPU (CAQR) FPGA (Tai etal) FPGA (Proposed) 12 GFLOPs 8 4 Tall-Skinny (m >> n) Number of Rows (m) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 17/18

51 Conclusions The communication-bound tall-skinny QR factorization becomes compute-bound with communication-avoiding linear algebra Tall-skinny QR performance is good on FPGAs compared to GPUs Large scratch pad memory vs Register file Local communication vs Global communication in merge stage High utilization of floating point cores vs low utilization of SMs during the irregular merge stage Comparison 45 speed up over CAQR 77 speed up over previous FPGA work Much higher speed up factors over Intel MKL and CULA QR routines Questions? Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 18/18

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): ) ENHANCING PERFORMANCE OF TALL-SKINNY QR FACTORIZATION USING FPGAS Abid Rafique, Nachiket Kapre and George A. Constantinides Electrical and Electronic Engineering Department Imperial College London London,

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Théo Mary, Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Jack Dongarra presenter 1 Low-Rank

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem

A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem Abid Rafique, Nachiket Kapre, and George A. Constantinides Electrical and Electronic Engineering,

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

A fast randomized algorithm for the approximation of matrices preliminary report

A fast randomized algorithm for the approximation of matrices preliminary report DRAFT A fast randomized algorithm for the approximation of matrices preliminary report Yale Department of Computer Science Technical Report #1380 Franco Woolfe, Edo Liberty, Vladimir Rokhlin, and Mark

More information

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor Masami Takata 1, Hiroyuki Ishigami 2, Kini Kimura 2, and Yoshimasa Nakamura 2 1 Academic Group of Information

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Sakurai-Sugiura algorithm based eigenvalue solver for Siesta. Georg Huhs

Sakurai-Sugiura algorithm based eigenvalue solver for Siesta. Georg Huhs Sakurai-Sugiura algorithm based eigenvalue solver for Siesta Georg Huhs Motivation Timing analysis for one SCF-loop iteration: left: CNT/Graphene, right: DNA Siesta Specifics High fraction of EVs needed

More information

Tall-and-skinny! QRs and SVDs in MapReduce

Tall-and-skinny! QRs and SVDs in MapReduce A 1 A 2 A 3 Tall-and-skinny! QRs and SVDs in MapReduce Yangyang Hou " Purdue, CS Austin Benson " Stanford University Paul G. Constantine Col. School. Mines " Joe Nichols U. of Minn James Demmel " UC Berkeley

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar

More information

Applied Linear Algebra in Geoscience Using MATLAB

Applied Linear Algebra in Geoscience Using MATLAB Applied Linear Algebra in Geoscience Using MATLAB Contents Getting Started Creating Arrays Mathematical Operations with Arrays Using Script Files and Managing Data Two-Dimensional Plots Programming in

More information

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Amal Khabou James Demmel Laura Grigori Ming Gu Electrical Engineering and Computer Sciences University of California

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

Case Study: Quantum Chromodynamics

Case Study: Quantum Chromodynamics Case Study: Quantum Chromodynamics Michael Clark Harvard University with R. Babich, K. Barros, R. Brower, J. Chen and C. Rebbi Outline Primer to QCD QCD on a GPU Mixed Precision Solvers Multigrid solver

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009 III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009 Background and motivation Running time of an algorithm

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Lecture 9: Numerical Linear Algebra Primer (February 11st) 10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 212 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

DM545 Linear and Integer Programming. Lecture 7 Revised Simplex Method. Marco Chiarandini

DM545 Linear and Integer Programming. Lecture 7 Revised Simplex Method. Marco Chiarandini DM545 Linear and Integer Programming Lecture 7 Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. 2. 2 Motivation Complexity of single pivot operation

More information

CALU: A Communication Optimal LU Factorization Algorithm

CALU: A Communication Optimal LU Factorization Algorithm CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

APPLIED NUMERICAL LINEAR ALGEBRA

APPLIED NUMERICAL LINEAR ALGEBRA APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation

More information

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Direct and Incomplete Cholesky Factorizations with Static Supernodes Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

Revised Simplex Method

Revised Simplex Method DM545 Linear and Integer Programming Lecture 7 Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. 2. 2 Motivation Complexity of single pivot operation

More information

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS

More information

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

Floating Point Architecture Extensions for Optimized Matrix Factorization

Floating Point Architecture Extensions for Optimized Matrix Factorization Floating Point Architecture Extensions for Optimized Matrix Factorization Ardavan Pedram, Andreas Gerstlauer Department of Electrical and Computer Engineering The University of Texas at Austin {ardavan,gerstl}@utexas.edu

More information

A hybrid Hermitian general eigenvalue solver

A hybrid Hermitian general eigenvalue solver Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,

More information

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Review of Linear Algebra

Review of Linear Algebra Review of Linear Algebra Dr Gerhard Roth COMP 40A Winter 05 Version Linear algebra Is an important area of mathematics It is the basis of computer vision Is very widely taught, and there are many resources

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June

More information

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element

More information

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator Chalermpol Saiprasert, Christos-Savvas Bouganis and George A. Constantinides Department of Electrical & Electronic

More information

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and

More information

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 7: More on Householder Reflectors; Least Squares Problems Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 15 Outline

More information

arxiv: v1 [cs.ms] 18 Nov 2016

arxiv: v1 [cs.ms] 18 Nov 2016 Bidiagonalization with Parallel Tiled Algorithms Mathieu Faverge,2, Julien Langou 3, Yves Robert 2,4, and Jack Dongarra 2,5 Bordeaux INP, CNRS, INRIA et Université de Bordeaux, France 2 University of Tennessee,

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information

1 Number Systems and Errors 1

1 Number Systems and Errors 1 Contents 1 Number Systems and Errors 1 1.1 Introduction................................ 1 1.2 Number Representation and Base of Numbers............. 1 1.2.1 Normalized Floating-point Representation...........

More information

James Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev

James Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev . Accurate and efficient expression evaluation and linear algebra, or Why it can be easier to compute accurate eigenvalues of a Vandermonde matrix than the accurate sum of 3 numbers James Demmel UC Berkeley

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Matrix Assembly in FEA

Matrix Assembly in FEA Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

Introduction to communication avoiding linear algebra algorithms in high performance computing

Introduction to communication avoiding linear algebra algorithms in high performance computing Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

Practicality of Large Scale Fast Matrix Multiplication

Practicality of Large Scale Fast Matrix Multiplication Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by

More information

On the Computational Complexity of the Discrete Pascal Transform

On the Computational Complexity of the Discrete Pascal Transform 6 th International Conference Logic and Applications LAP 207, September 8-22, 207, Dubrovnik, Croatia On the Computational Complexity of the Discrete Pascal Transform Dušan B. Gajić, Radomir S. Stanković

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

Parallel Singular Value Decomposition. Jiaxing Tan

Parallel Singular Value Decomposition. Jiaxing Tan Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate SVD? How to parallelize SVD? Future Work What is SVD? Matrix Decomposition Eigen Decomposition A (non-zero) vector

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Section 4.5 Eigenvalues of Symmetric Tridiagonal Matrices

Section 4.5 Eigenvalues of Symmetric Tridiagonal Matrices Section 4.5 Eigenvalues of Symmetric Tridiagonal Matrices Key Terms Symmetric matrix Tridiagonal matrix Orthogonal matrix QR-factorization Rotation matrices (plane rotations) Eigenvalues We will now complete

More information

Reproducible Tall-Skinny QR

Reproducible Tall-Skinny QR Reproducible Tall-Skinny QR James Demmel and Hong Diep Nguyen University of California, Berkeley, Berkeley, USA demmel@cs.berkeley.edu, hdnguyen@eecs.berkeley.edu Abstract Reproducibility is the ability

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems LESLIE FOSTER and RAJESH KOMMU San Jose State University Existing routines, such as xgelsy or xgelsd in LAPACK, for

More information

arxiv: v1 [cs.dc] 4 Sep 2014

arxiv: v1 [cs.dc] 4 Sep 2014 and NVIDIA R GPUs arxiv:1409.1510v1 [cs.dc] 4 Sep 2014 O. Kaczmarek, C. Schmidt and P. Steinbrecher Fakultät für Physik, Universität Bielefeld, D-33615 Bielefeld, Germany E-mail: okacz, schmidt, p.steinbrecher@physik.uni-bielefeld.de

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois

More information

Towards Fast, Accurate and Reproducible LU Factorization

Towards Fast, Accurate and Reproducible LU Factorization Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3

More information

Spiral. Program Synthesis for Performance. and the Spiral team (only part shown) Supported by DARPA, ONR, NSF, Intel, Mercury

Spiral. Program Synthesis for Performance. and the Spiral team (only part shown) Supported by DARPA, ONR, NSF, Intel, Mercury Spiral Program Synthesis for Performance Joint work with Franz Franchetti Yevgen Voronenko Srinivas Chellappa Frédéric de Mesmay Daniel McFarlin José Moura James Hoe and the Spiral team (only part shown)

More information

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points Michael Griebel Christian Rieger Peter Zaspel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 5: Numerical Linear Algebra Cho-Jui Hsieh UC Davis April 20, 2017 Linear Algebra Background Vectors A vector has a direction and a magnitude

More information

Scalable, Robust, Fault-Tolerant Parallel QR Factorization. Camille Coti

Scalable, Robust, Fault-Tolerant Parallel QR Factorization. Camille Coti Communication-avoiding Fault-tolerant Scalable, Robust, Fault-Tolerant Parallel Factorization Camille Coti LIPN, CNRS UMR 7030, SPC, Université Paris 13, France DCABES August 24th, 2016 1 / 21 C. Coti

More information

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters Mike Showerman, Guochun Shi Steven Gottlieb Outline Background Lattice QCD and MILC GPU and Cray XK6 node architecture Implementation

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

Index. book 2009/5/27 page 121. (Page numbers set in bold type indicate the definition of an entry.)

Index. book 2009/5/27 page 121. (Page numbers set in bold type indicate the definition of an entry.) page 121 Index (Page numbers set in bold type indicate the definition of an entry.) A absolute error...26 componentwise...31 in subtraction...27 normwise...31 angle in least squares problem...98,99 approximation

More information

Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators

Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators Sergiy Gogolenko, Zhaojun Bai 2, and Richard Scalettar 2 Donetsk National Technical University, Donetsk, 8300,

More information

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725 Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple

More information

Computation of the mtx-vec product based on storage scheme on vector CPUs

Computation of the mtx-vec product based on storage scheme on vector CPUs BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information