Enhancing Performance of Tall-Skinny QR Factorization using FPGAs
|
|
- Isaac Hamilton
- 5 years ago
- Views:
Transcription
1 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18
2 Our Claim Common Misconception: Dense linear algebra is always better on GPU We will show that it is true for square matrices but for tall-skinny matrices, FPGA is a better architecture Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 2/18
3 Dense Linear Algebra - Tall-Skinny QR Factorization Stationary Video Background Subtraction = + Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 3/18
4 Dense Linear Algebra - Tall-Skinny QR Factorization Stationary Video Background Subtraction = + Potential Real-time Applications (3 frames/sec) Video Surveillance Traffic Monitoring Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 3/18
5 Dense Linear Algebra - Tall-Skinny QR Factorization Stationary Video Background Subtraction = + Potential Real-time Applications (3 frames/sec) Video Surveillance Traffic Monitoring Robust Method Singular Value Decomposition (SVD) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 3/18
6 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11,592 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18
7 11,592 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11, Q U Σ V T R Threshold these singular values Converge? 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18
8 11,592 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11,592 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) 1 1 Q U Σ V T R Threshold these singular values Converge? 1 CPU-Only 1 min Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18
9 11,592 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11,592 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) 1 1 Q U Σ V T R Threshold these singular values Converge? 1 CPU-Only 1 min GPU + CPU 17 sec Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18
10 11,592 Dense Linear Algebra - Tall-Skinny QR Factorization Tall-Skinny Matri (M) - containing video frames as columns M = f 1 f 2 f99 f 1 m n = 11,592 1 SVD : O(m 2 n + mn 2 + n 3 ) QR : O(mn 2 + n 3 ) 1 1 Q U Σ V T R Threshold these singular values Converge? 1 CPU-Only 1 min GPU + CPU 17 sec FPGA + CPU 2-3 sec Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 4/18
11 Some Etreme Aspect Ratios of Tall-Skinny QR Application Aspect Ratio ( m n ) Background Subtraction 1 Least Squares 1 Iterative Solvers 1, Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 5/18
12 What we want to achieve? 2 16 CULA (C25) 2 16 GFLOPs Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 6/18
13 What we want to achieve? 2 CULA (C25) 35% GFLOPs 12 8 Tall-Skinny (m >> n) % Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 6/18
14 What we want to achieve? 2 16 CULA (C25) CAQR (C25) 35% 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) % 4 22% Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 6/18
15 Outline What is QR Factorization? Why Tall-Skinny QR is challenging to parallelize? How Communication-Avoiding QR (CAQR) eposes parallelism? Our Contribution Why GPUs under-perform for CAQR? How we can eploit architectural features of FPGAs to enhance performance? Results Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 7/18
16 QR Factorization - Householder Reflections Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return m k = 1 n P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 8/18
17 QR Factorization - Householder Reflections Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return m k = 2 n P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 8/18
18 QR Factorization - Householder Reflections Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return m k = 3 n P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 8/18
19 QR Factorization - Householder Reflections Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return m k = 3 n P 3 P 2 P 1 A R = P n 1 P n 2 P 1 A Q = P T 1 P T 2 P T n 2P T n 1 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 8/18
20 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18
21 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18
22 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k Processor Processor 1 Processor 2 k = 1 N p = 3 P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18
23 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k T k k log 2 N p Processor Processor 1 Processor 2 k = 1 N p = 3 P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18
24 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k T k k log 2 N p v T k i log 2 N p Processor Processor 1 Processor 2 k = 1 N p = 3 P 3 P 2 P 1 A Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18
25 Why Tall-Skinny are challenging to parallelize? Algorithm Householder QR Require: A matri A R m n for k = 1 to n 1 do k := A(k:m, k) [,τ k ] := house( k ) P k := I + τ k v T k A(k:m, k:n) := P k A(k:m, k:n) end for return d d k 1 2 kt k = k ± d 2 v (1) k = (1) k d 3 T -2/d β i 3 τ T k d ' i 4 i + d 5 v i k i d d 4 5 τ k T k k log 2 N p v T k i log 2 N p O(nlogN p ) Sync Points Processor Processor 1 Processor 2 P 3 P 2 P 1 A k = 1 N p = 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 9/18
26 Communication-Avoiding QR Factorization Communication is costly compared to computation Divide-and-Conquer Approach- Do as much local computations as possible and avoid communications Only R factors are communicated to the neighbours n R R1 R2 A R1 m A 1 A 2 R2 O(logN p ) Sync Points R3 A 3 Local QR Merge Tree Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18
27 Outline What is QR Factorization? Why Tall-Skinny QR is challenging to parallelise? How Communication-Avoiding QR (CAQR) eposes parallelism? Our Contribution Why GPUs under-perform for CAQR? How we can eploit architectural features of FPGAs to enhance performance? Results Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 11/18
28 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory (~8 cycles latency) A R1 A 1 m A 2 R2 Shared Shared Shared Shared R3 A 3 Local QR Merge Tree SM Registers A SM Registers A 1 SM Registers A 2 SM Registers A 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18
29 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory R (~8 cycles latency) R 1 R 2 R 3 R1 A 1 m R2 Shared Shared Shared Shared A 2 R3 A 3 Local QR Merge Tree SM Registers A SM Registers A 1 SM Registers A 2 SM Registers A 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18
30 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory (~8 cycles latency) R1 A 1 m A 2 R2 Shared Shared Shared Shared R3 A 3 Local QR Merge Tree SM Registers [R ; R 1 ] SM Registers A 1 SM Registers [R 2; R 3 ] SM Registers A 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18
31 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory (~8 cycles latency) R 1 R 11 R1 A 1 m A 2 R2 Shared Shared Shared Shared R3 A 3 Local QR Merge Tree SM Registers [R ; R 1 ] SM Registers A 1 SM Registers [R 2; R 3 ] SM Registers A 3 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18
32 Why GPUs under-perform for CAQR? n A R R1 R2 Global Memory (~8 cycles latency) R1 A 1 m A 2 R2 Shared Shared Shared Shared R3 A 3 Local QR Merge Tree SM Registers [R 1; R 11 ] SM Registers A 1 SM Registers [R 2; R 3 ] SM Registers A 3 Small register file size lead to low arithmetic intensity Global communication in merge stage leads to high lateny Low utilization of SMs as few are active in successive merge stages Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 12/18
33 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18
34 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 R3 A 3 A 1 A Local QR Merge Tree Dense Householder QR sel1 Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18
35 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree A 3 A 2 Householder QR A 1 sel1 A Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18
36 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree A 3 A 2 Householder QR sel1 R R 1 sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18
37 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR A 3 sel1 A 2 R R 1 sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18
38 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 R 2 R R 3 R 1 sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18
39 R 2 R 3 R R 1 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18
40 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 R 1 R 11 sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18
41 R 1 R 11 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Dense Householder QR sel1 Triangle sel2 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18
42 How do we enhance performance using FPGAs? n A R R1 R1 R2 Global Memory A A 1 m R2 sel A 2 A 3 R3 Dense Local QR Merge Tree Large Scratch Pad Memory arithmetic intensity 16 sel1 Dense R 2 sel2 Householder QR Local communication High utilization due to pipeline parallelism Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 13/18
43 X X Householder QR Datapath n, 1,, k X O(n) Resources O(n) MemPorts k kt k d 1 d 2 v (1) k = (1) k ± d 2 d 3 β T -2/d i 3 τ k, k X + O(n 2 ) Latency = k DOT PRODUCT τ T k d ' i 4 i + d 5 v i k i d 4 d 5 X X + + O(n) Resources O(n) MemPorts X + O(1) Latency X + AXPY Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 14/18
44 X X Householder QR Datapath n, 1,, k X O(n) Resources O(n) MemPorts k kt k d 1 d 2 v (1) k = (1) k ± d 2 d 3 β T -2/d i 3 τ k, k X + O(n 2 ) Latency = k DOT PRODUCT τ T k d ' i 4 i + d 5 v i k i d 4 d 5 X X + + O(n) Resources O(n) MemPorts X + O(1) Latency X + AXPY Utilizing high memory bandwidth and full resource utilization of FPGA we get 129 Peak GFLOPs (Double-Precision) on Virte6-SX475T Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 14/18
45 Eperimental Setup Nvidia C25 Fermi (515 Peak Double Precision GFLOPs) CULA QR Routine CAQR Routine from UC Berkeley (Thanks to Michael Anderson) Xilin Virte-6 SX475T with Xilin Coregen Library(225 Peak Double Precision GFLOPs ) Synthesizing Tiled Matri Decomposition on FPGAs Tai etal (FPL 211) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 15/18
46 FPGAs Performance vs Matri Width (m = 64) GPU (CULA) 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 16/18
47 FPGAs Performance vs Matri Width (m = 64) GPU (CULA) GPU (CAQR) 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 16/18
48 FPGAs Performance vs Matri Width (m = 64) GPU (CULA) GPU (CAQR) FPGA (Tai etal) 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 16/18
49 FPGAs Performance vs Matri Width (m = 64) GPU (CULA) GPU (CAQR) FPGA (Tai etal) FPGA (Proposed) 2 16 GFLOPs 12 8 Tall-Skinny (m >> n) Number of Columns (n) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 16/18
50 FPGAs Performance vs Matri Height (n = 51) GPU (CULA) GPU (CAQR) FPGA (Tai etal) FPGA (Proposed) 12 GFLOPs 8 4 Tall-Skinny (m >> n) Number of Rows (m) Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 17/18
51 Conclusions The communication-bound tall-skinny QR factorization becomes compute-bound with communication-avoiding linear algebra Tall-skinny QR performance is good on FPGAs compared to GPUs Large scratch pad memory vs Register file Local communication vs Global communication in merge stage High utilization of floating point cores vs low utilization of SMs during the irregular merge stage Comparison 45 speed up over CAQR 77 speed up over previous FPGA work Much higher speed up factors over Intel MKL and CULA QR routines Questions? Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 18/18
Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )
ENHANCING PERFORMANCE OF TALL-SKINNY QR FACTORIZATION USING FPGAS Abid Rafique, Nachiket Kapre and George A. Constantinides Electrical and Electronic Engineering Department Imperial College London London,
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationPerformance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs
Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Théo Mary, Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Jack Dongarra presenter 1 Low-Rank
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationA High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem
A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem Abid Rafique, Nachiket Kapre, and George A. Constantinides Electrical and Electronic Engineering,
More informationCommunication-avoiding LU and QR factorizations for multicore architectures
Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationLevel-3 BLAS on a GPU
Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationA fast randomized algorithm for the approximation of matrices preliminary report
DRAFT A fast randomized algorithm for the approximation of matrices preliminary report Yale Department of Computer Science Technical Report #1380 Franco Woolfe, Edo Liberty, Vladimir Rokhlin, and Mark
More informationBLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationPerformance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor
Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor Masami Takata 1, Hiroyuki Ishigami 2, Kini Kimura 2, and Yoshimasa Nakamura 2 1 Academic Group of Information
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationSakurai-Sugiura algorithm based eigenvalue solver for Siesta. Georg Huhs
Sakurai-Sugiura algorithm based eigenvalue solver for Siesta Georg Huhs Motivation Timing analysis for one SCF-loop iteration: left: CNT/Graphene, right: DNA Siesta Specifics High fraction of EVs needed
More informationTall-and-skinny! QRs and SVDs in MapReduce
A 1 A 2 A 3 Tall-and-skinny! QRs and SVDs in MapReduce Yangyang Hou " Purdue, CS Austin Benson " Stanford University Paul G. Constantine Col. School. Mines " Joe Nichols U. of Minn James Demmel " UC Berkeley
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar
More informationApplied Linear Algebra in Geoscience Using MATLAB
Applied Linear Algebra in Geoscience Using MATLAB Contents Getting Started Creating Arrays Mathematical Operations with Arrays Using Script Files and Managing Data Two-Dimensional Plots Programming in
More informationLU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version
LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Amal Khabou James Demmel Laura Grigori Ming Gu Electrical Engineering and Computer Sciences University of California
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationCase Study: Quantum Chromodynamics
Case Study: Quantum Chromodynamics Michael Clark Harvard University with R. Babich, K. Barros, R. Brower, J. Chen and C. Rebbi Outline Primer to QCD QCD on a GPU Mixed Precision Solvers Multigrid solver
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationRe-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009
III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009 Background and motivation Running time of an algorithm
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationLecture 9: Numerical Linear Algebra Primer (February 11st)
10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 212 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationDM545 Linear and Integer Programming. Lecture 7 Revised Simplex Method. Marco Chiarandini
DM545 Linear and Integer Programming Lecture 7 Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. 2. 2 Motivation Complexity of single pivot operation
More informationCALU: A Communication Optimal LU Factorization Algorithm
CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9
More informationJulian Merten. GPU Computing and Alternative Architecture
Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg
More informationAPPLIED NUMERICAL LINEAR ALGEBRA
APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation
More informationDirect and Incomplete Cholesky Factorizations with Static Supernodes
Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)
More information1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria
1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation
More informationRevised Simplex Method
DM545 Linear and Integer Programming Lecture 7 Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. 2. 2 Motivation Complexity of single pivot operation
More informationACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS
More informationTile QR Factorization with Parallel Panel Processing for Multicore Architectures
Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University
More informationDynamic Scheduling within MAGMA
Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing
More informationFloating Point Architecture Extensions for Optimized Matrix Factorization
Floating Point Architecture Extensions for Optimized Matrix Factorization Ardavan Pedram, Andreas Gerstlauer Department of Electrical and Computer Engineering The University of Texas at Austin {ardavan,gerstl}@utexas.edu
More informationA hybrid Hermitian general eigenvalue solver
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationReview of Linear Algebra
Review of Linear Algebra Dr Gerhard Roth COMP 40A Winter 05 Version Linear algebra Is an important area of mathematics It is the basis of computer vision Is very widely taught, and there are many resources
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June
More informationInformation Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and
Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element
More informationWord-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator
Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator Chalermpol Saiprasert, Christos-Savvas Bouganis and George A. Constantinides Department of Electrical & Electronic
More informationIntroduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra
Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and
More informationTall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222
Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department
More informationParallel Numerics. Scope: Revise standard numerical methods considering parallel computations!
Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 7: More on Householder Reflectors; Least Squares Problems Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 15 Outline
More informationarxiv: v1 [cs.ms] 18 Nov 2016
Bidiagonalization with Parallel Tiled Algorithms Mathieu Faverge,2, Julien Langou 3, Yves Robert 2,4, and Jack Dongarra 2,5 Bordeaux INP, CNRS, INRIA et Université de Bordeaux, France 2 University of Tennessee,
More informationPreconditioned Parallel Block Jacobi SVD Algorithm
Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic
More information1 Number Systems and Errors 1
Contents 1 Number Systems and Errors 1 1.1 Introduction................................ 1 1.2 Number Representation and Base of Numbers............. 1 1.2.1 Normalized Floating-point Representation...........
More informationJames Demmel UC Berkeley Math and EECS Depts. Joint work with Ioana Dumitriu, Olga Holtz, Plamen Koev
. Accurate and efficient expression evaluation and linear algebra, or Why it can be easier to compute accurate eigenvalues of a Vandermonde matrix than the accurate sum of 3 numbers James Demmel UC Berkeley
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationMatrix Assembly in FEA
Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More informationIntroduction to communication avoiding linear algebra algorithms in high performance computing
Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More informationPracticality of Large Scale Fast Matrix Multiplication
Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by
More informationOn the Computational Complexity of the Discrete Pascal Transform
6 th International Conference Logic and Applications LAP 207, September 8-22, 207, Dubrovnik, Croatia On the Computational Complexity of the Discrete Pascal Transform Dušan B. Gajić, Radomir S. Stanković
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform
CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast
More informationSparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More informationBinding Performance and Power of Dense Linear Algebra Operations
10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique
More informationParallel Singular Value Decomposition. Jiaxing Tan
Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate SVD? How to parallelize SVD? Future Work What is SVD? Matrix Decomposition Eigen Decomposition A (non-zero) vector
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationSection 4.5 Eigenvalues of Symmetric Tridiagonal Matrices
Section 4.5 Eigenvalues of Symmetric Tridiagonal Matrices Key Terms Symmetric matrix Tridiagonal matrix Orthogonal matrix QR-factorization Rotation matrices (plane rotations) Eigenvalues We will now complete
More informationReproducible Tall-Skinny QR
Reproducible Tall-Skinny QR James Demmel and Hong Diep Nguyen University of California, Berkeley, Berkeley, USA demmel@cs.berkeley.edu, hdnguyen@eecs.berkeley.edu Abstract Reproducibility is the ability
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012
MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7
More informationTips Geared Towards R. Adam J. Suarez. Arpil 10, 2015
Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done
More informationEfficient implementation of the overlap operator on multi-gpus
Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator
More informationAlgorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems
Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems LESLIE FOSTER and RAJESH KOMMU San Jose State University Existing routines, such as xgelsy or xgelsd in LAPACK, for
More informationarxiv: v1 [cs.dc] 4 Sep 2014
and NVIDIA R GPUs arxiv:1409.1510v1 [cs.dc] 4 Sep 2014 O. Kaczmarek, C. Schmidt and P. Steinbrecher Fakultät für Physik, Universität Bielefeld, D-33615 Bielefeld, Germany E-mail: okacz, schmidt, p.steinbrecher@physik.uni-bielefeld.de
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois
More informationTowards Fast, Accurate and Reproducible LU Factorization
Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3
More informationSpiral. Program Synthesis for Performance. and the Spiral team (only part shown) Supported by DARPA, ONR, NSF, Intel, Mercury
Spiral Program Synthesis for Performance Joint work with Franz Franchetti Yevgen Voronenko Srinivas Chellappa Frédéric de Mesmay Daniel McFarlin José Moura James Hoe and the Spiral team (only part shown)
More informationBreaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points
Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points Michael Griebel Christian Rieger Peter Zaspel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 5: Numerical Linear Algebra Cho-Jui Hsieh UC Davis April 20, 2017 Linear Algebra Background Vectors A vector has a direction and a magnitude
More informationScalable, Robust, Fault-Tolerant Parallel QR Factorization. Camille Coti
Communication-avoiding Fault-tolerant Scalable, Robust, Fault-Tolerant Parallel Factorization Camille Coti LIPN, CNRS UMR 7030, SPC, Université Paris 13, France DCABES August 24th, 2016 1 / 21 C. Coti
More informationTuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb
Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters Mike Showerman, Guochun Shi Steven Gottlieb Outline Background Lattice QCD and MILC GPU and Cray XK6 node architecture Implementation
More informationMARCH 24-27, 2014 SAN JOSE, CA
MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =
More informationIndex. book 2009/5/27 page 121. (Page numbers set in bold type indicate the definition of an entry.)
page 121 Index (Page numbers set in bold type indicate the definition of an entry.) A absolute error...26 componentwise...31 in subtraction...27 normwise...31 angle in least squares problem...98,99 approximation
More informationStructured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators
Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators Sergiy Gogolenko, Zhaojun Bai 2, and Richard Scalettar 2 Donetsk National Technical University, Donetsk, 8300,
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple
More informationComputation of the mtx-vec product based on storage scheme on vector CPUs
BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on
More informationBlock AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark
Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise
More information