Communication-avoiding LU and QR factorizations for multicore architectures
|
|
- Kory Flynn
- 5 years ago
- Views:
Transcription
1 Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway th April 2010 Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
2 1 Introduction 2 CALU and CAQR factorization 3 Multithreaded CALU and CAQR 4 Experimental section 5 Conclusion Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
3 1 Introduction 2 CALU and CAQR factorization 3 Multithreaded CALU and CAQR 4 Experimental section 5 Conclusion Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
4 Introduction Architectural trends show an increasing communication cost compared to the time it takes to perform arithmetic operations Motivated the design of communication avoiding algorithms that minimize communication First results are CAQR [Demmel, Grigori, Hoemmen, Langou 08] and CALU [Grigori, Demmel, Xiang 08], implemented for distributed memory. Our goal is to design multithreaded QR and LU factorizations for multicores based on communication avoiding algorithms. Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
5 LU factorization with partial pivoting Factorization on Pr by Pc grid of processors as implemented in SCALAPACK: For ib = 1 to n-1 step b A(ib) = A(ib:n, ib:n) 1 Compute panel factorization (pdgetf2) O(nlog 2P r) - find pivot in each column, swap rows 2 Apply all row permutations (pdlaswp) O(n/b(log 2P c + log 2P r)) - broadcast pivot information along the rows - swap rows at left and right 3 Compute block row of U (pdtrsm) O(n/blog 2P c) - broadcast right diagonal block of L of current panel 4 Update trailing matrix (pdgemm) O(n/b(log 2P c + log 2P r)) - broadcast right block column of L - broadcast down block row of U Pivoting requires communication among processors on distributed memory and synchronisation between threads on multicores. Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
6 CALU and CAQR approach Communication avoiding algorithms [Demmel, Grigori, Hoemmen, Langou, Xiang 08] approach: Decrease communication required for pivoting and overcome the latency bottleneck of classic algorithms by performing the factorization of a block column (a tall and skinny matrix) as a reduction operation and doing some redundant computations They are communication optimal in terms of both latency and bandwidth They lead to important speedups on distributed memory computers Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
7 Goal Our goal Combine the main ideas to reduce communication in CALU and CAQR with : appropriate blocking task identification dynamic scheduling The reduction operation to use for a block-column factorization is based on a binary tree with asynchronous tasks : reduces synchronisation between threads (only O(log 2 (Pr))) avoids bus contention Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
8 1 Introduction 2 CALU and CAQR factorization 3 Multithreaded CALU and CAQR 4 Experimental section 5 Conclusion Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
9 CAQR Figure: Parallel TSQR Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25 Each panel factorization is computed as a reduction operation where at each node a QR factorization is performed. The reduction tree is chosen depending on the underlying architecture. For a binary tree log 2 (Pr) steps are used.
10 CAQR Update the submatrix using the tree in log 2 (Pr) steps Figure: The update of the trailing submatrix is triggered by the reduction tree used during panel factorization Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
11 CALU[Grigori, Demmel, Xiang 08] The panel factorization is performed in two steps: A preprocessing steps aims at identifying at low communication cost good pivot rows The pivot rows are permuted in the first positions of the panel and LU without pivoting of the panel is performed Figure: Stable parallel panel factorization Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
12 CALU (Stability) average growth factor P=256,b=32 P=256,b=16 P=128,b=64 P=128,b=32 P=128,b=16 P=64, b=128 P=64, b=64 P=64, b=32 P=64, b=16 GEPP n 2/3 2*n 2/3 3*n 1/ Figure: Stability of binary tree based CALU factorization for random matrices Extensive tests performed on random matrices and a set of special matrices using binary tree and flat tree show that CALU is as stable as GEPP in practice. Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
13 1 Introduction 2 CALU and CAQR factorization 3 Multithreaded CALU and CAQR 4 Experimental section 5 Conclusion Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
14 Multithreaded CALU The matrix is partitioned in blocks of size Tr x b The computation of each block is associated with a task The task dependency graph is scheduled using a dynamic scheduler Figure: Matrix 4 4 blocks and T r =2and Corresponding task dependency graph Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
15 Multithreaded CALU Panel factorization is performed in two steps: find good pivots at low communication cost, permute them and compute LU factorization of the panel without pivoting. The panel factorization stays on the critical path but it is done more efficiently Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
16 Multithreaded CALU (Execution) Figure: Example of execution of CALU for a tall skinny matrix, using b = 100 and T r =1, on 8-core Figure: Example of execution of CALU for a tall skinny matrix, using b = 100 and T r =8, on 8-core Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
17 Multithreaded CAQR Same approach as CALU but: Panel factorization is performed only once The update of the trailing matrix is triggered by the binary tree used for the panel factorization. Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
18 1 Introduction 2 CALU and CAQR factorization 3 Multithreaded CALU and CAQR 4 Experimental section 5 Conclusion Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
19 Environments Tests performed on: two-socket, quad-core machine based on Intel Xeon EMT64 processor running on Linux and on a four-socket, quad-core machine based on AMD Opteron processor Comparison with MKL and PLASMA 2.0 (with default parameters) b = MIN(n, 100) has been chosen as block size Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
20 Performance of CALU Performance of CALU, MKL_dgetrf, PLASMA_dgetrf on 8 cores Tall Skinny Matrix, CALU, m=10 5 MKL_dgetf2 MKL_dgetrf PLASMA_dgetrf CALU(Tr=4) CALU(Tr=8) GFlops/s log2(n) Figure: m=10 5 and varying n from 10 to Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
21 Performance of CALU Performance of CALU, MKL_dgetrf, PLASMA_dgetrf on 16 cores Tall Skinny Matrix, CALU, m=10 5 ACML_dgeqrf PLASMA_dgeqrf CALU(Tr=8) CALU(Tr=16) 30 GFlops/s log2(n) Figure: m=10 5 and varying n from 10 to Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
22 Performance of CAQR Performance of CAQR, MKL_dgeqrf, PLASMA_dgeqrf on 8 cores Tall Skinny Matrix, CAQR, m=10 5 MKL_dgeqrf PLASMA_dgeqrf CAQR(Tr=2) CAQR(Tr=4) CAQR(Tr=8) TSQR GFlops/s log2(n) Figure: m=10 5 and varying n from 10 to Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
23 1 Introduction 2 CALU and CAQR factorization 3 Multithreaded CALU and CAQR 4 Experimental section 5 Conclusion Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
24 Conclusion Multithreaded CALU and CAQR lead to important improvements for tall and skinny matrices with respect to the corresponding routines in MKL and PLASMA. PLASMA becomes more efficient with increasing number of columns. No significant improvements obtained so far for square matrices. Prospects: Improve the performance of the trailing matrix update by increasing the block size to optimize BLAS3 operations. Compare with the recent approach of [Hadri, Ltaief, Agullo, Dongarra 09] for QR factorization, which uses a different reduction tree during panel factorization. Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
25 Thank you Thank you Communication-avoiding LU and QR factorizations for multicore architectures 16th April / 25
Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationLU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version
1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2
More informationTile QR Factorization with Parallel Panel Processing for Multicore Architectures
Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University
More informationTall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222
Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationCALU: A Communication Optimal LU Factorization Algorithm
CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationAvoiding communication in linear algebra. Laura Grigori INRIA Saclay Paris Sud University
Avoiding communication in linear algebra Laura Grigori INRIA Saclay Paris Sud University Motivation Algorithms spend their time in doing useful computations (flops) or in moving data between different
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationCOALA: Communication Optimal Algorithms
COALA: Communication Optimal Algorithms for Linear Algebra Jim Demmel EECS & Math Depts. UC Berkeley Laura Grigori INRIA Saclay Ile de France Collaborators and Supporters Collaborators at Berkeley (campus
More informationIntroduction to communication avoiding linear algebra algorithms in high performance computing
Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for
More informationIntroduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra
Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and
More informationLU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version
LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Amal Khabou James Demmel Laura Grigori Ming Gu Electrical Engineering and Computer Sciences University of California
More informationAvoiding Communication in Distributed-Memory Tridiagonalization
Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura
More informationMinimizing Communication in Linear Algebra. James Demmel 15 June
Minimizing Communication in Linear Algebra James Demmel 15 June 2010 www.cs.berkeley.edu/~demmel 1 Outline What is communication and why is it important to avoid? Direct Linear Algebra Lower bounds on
More informationc 2013 Society for Industrial and Applied Mathematics
SIAM J. MATRIX ANAL. APPL. Vol. 34, No. 3, pp. 1401 1429 c 2013 Society for Industrial and Applied Mathematics LU FACTORIZATION WITH PANEL RANK REVEALING PIVOTING AND ITS COMMUNICATION AVOIDING VERSION
More informationTable 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )
ENHANCING PERFORMANCE OF TALL-SKINNY QR FACTORIZATION USING FPGAS Abid Rafique, Nachiket Kapre and George A. Constantinides Electrical and Electronic Engineering Department Imperial College London London,
More informationDense LU factorization and its error analysis
Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationCommunication Avoiding LU Factorization using Complete Pivoting
Communication Avoiding LU Factorization using Complete Pivoting Implementation and Analysis Avinash Bhardwaj Department of Industrial Engineering and Operations Research University of California, Berkeley
More informationDynamic Scheduling within MAGMA
Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing
More informationScalable, Robust, Fault-Tolerant Parallel QR Factorization
Scalable, Robust, Fault-Tolerant Parallel QR Factorization Camille Coti LIPN, CNRS UMR 7030, Université Paris 13, Sorbonne Paris Cité 99, avenue Jean-Baptiste Clément, F-93430 Villetaneuse, FRANCE camille.coti@lipn.univ-paris13.fr
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June
More informationLAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou Dpt of Electrical Engineering
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois
More informationIterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay
Iterative methods, preconditioning, and their application to CMB data analysis Laura Grigori INRIA Saclay Plan Motivation Communication avoiding for numerical linear algebra Novel algorithms that minimize
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationRank revealing factorizations, and low rank approximations
Rank revealing factorizations, and low rank approximations L. Grigori Inria Paris, UPMC January 2018 Plan Low rank matrix approximation Rank revealing QR factorization LU CRTP: Truncated LU factorization
More informationarxiv: v1 [cs.ms] 18 Nov 2016
Bidiagonalization with Parallel Tiled Algorithms Mathieu Faverge,2, Julien Langou 3, Yves Robert 2,4, and Jack Dongarra 2,5 Bordeaux INP, CNRS, INRIA et Université de Bordeaux, France 2 University of Tennessee,
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationSymmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano
Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationAvoiding Communication in Linear Algebra
Avoiding Communication in Linear Algebra Grey Ballard Microsoft Research Faculty Summit July 15, 2014 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation,
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationA parallel tiled solver for dense symmetric indefinite systems on multicore architectures
A parallel tiled solver for dense symmetric indefinite systems on multicore architectures Marc Baboulin, Dulceneia Becker, Jack Dongarra INRIA Saclay-Île de France, F-91893 Orsay, France Université Paris
More informationSpécialité: Informatique. par. Marc Baboulin. Maître de conférences, Université Paris-Sud Chaire Inria Saclay - Île-de-France
HABILITATION A DIRIGER DES RECHERCHES présentée à l Université Paris-Sud Spécialité: Informatique par Marc Baboulin Maître de conférences, Université Paris-Sud Chaire Inria Saclay - Île-de-France Résolutions
More informationDivide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures
Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures Grégoire Pichon, Azzam Haidar, Mathieu Faverge, Jakub Kurzak To cite this version: Grégoire Pichon, Azzam Haidar, Mathieu
More informationEnhancing Scalability of Sparse Direct Methods
Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.
More informationUsing Postordering and Static Symbolic Factorization for Parallel Sparse LU
Using Postordering and Static Symbolic Factorization for Parallel Sparse LU Michel Cosnard LORIA - INRIA Lorraine Nancy, France Michel.Cosnard@loria.fr Laura Grigori LORIA - Univ. Henri Poincaré Nancy,
More informationEnhancing Performance of Tall-Skinny QR Factorization using FPGAs
Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common
More informationOn aggressive early deflation in parallel variants of the QR algorithm
On aggressive early deflation in parallel variants of the QR algorithm Bo Kågström 1, Daniel Kressner 2, and Meiyue Shao 1 1 Department of Computing Science and HPC2N Umeå University, S-901 87 Umeå, Sweden
More informationBinding Performance and Power of Dense Linear Algebra Operations
10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique
More information1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria
1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation
More informationDesigning a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems
Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems Emmanuel AGULLO (INRIA HiePACS team) Julien LANGOU (University of Colorado Denver) Joint work with University
More informationReproducible Tall-Skinny QR
Reproducible Tall-Skinny QR James Demmel and Hong Diep Nguyen University of California, Berkeley, Berkeley, USA demmel@cs.berkeley.edu, hdnguyen@eecs.berkeley.edu Abstract Reproducibility is the ability
More informationMUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux
The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More information2.5D algorithms for distributed-memory computing
ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster
More informationSaving Energy in Sparse and Dense Linear Algebra Computations
Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain
More informationDistributed QR Factorization Based on Randomized Algorithms
Distributed QR Factorization Based on Randomized Algorithms Hana Straková 1, Wilfried N. Gansterer 1,, and Thomas Zemen 2 1 University of Vienna, Austria, Research Group Theory and Applications of Algorithms
More informationPreconditioned Parallel Block Jacobi SVD Algorithm
Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic
More informationSaving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors
20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,
More informationMAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors
MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:
More informationLightweight Superscalar Task Execution in Distributed Memory
Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationRe-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009
III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009 Background and motivation Running time of an algorithm
More informationTOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM
TOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM AZZAM HAIDAR, HATEM LTAIEF, AND JACK DONGARRA Abstract. Classical solvers for the dense symmetric eigenvalue
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 212 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou To cite this version: Emmanuel Agullo, Camille Coti,
More informationPower Profiling of Cholesky and QR Factorizations on Distributed Memory Systems
Noname manuscript No. (will be inserted by the editor) Power Profiling of Cholesky and QR Factorizations on Distributed s George Bosilca Hatem Ltaief Jack Dongarra Received: date / Accepted: date Abstract
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationA NOVEL PARALLEL QR ALGORITHM FOR HYBRID DISTRIBUTED MEMORY HPC SYSTEMS
A NOVEL PARALLEL QR ALGORITHM FOR HYBRID DISTRIBUTED MEMORY HPC SYSTEMS ROBERT GRANAT, BO KÅGSTRÖM, AND DANIEL KRESSNER Abstract A novel variant of the parallel QR algorithm for solving dense nonsymmetric
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationUsing Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods
Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Marc Baboulin 1, Xiaoye S. Li 2 and François-Henry Rouet 2 1 University of Paris-Sud, Inria Saclay, France 2 Lawrence Berkeley
More informationAasen s Symmetric Indefinite Linear Solvers in LAPACK
Aasen s Symmetric Indefinite Linear Solvers in LAPACK Ichitaro Yamazaki and Jack Dongarra December 2, 217 Abstract Recently, we released two LAPACK subroutines that implement Aasen s algorithms for solving
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J MATRIX ANAL APPL Vol 36, No, pp 55 89 c 05 Society for Industrial and Applied Mathematics COMMUNICATION AVOIDING RANK REVEALING QR FACTORIZATION WITH COLUMN PIVOTING JAMES W DEMMEL, LAURA GRIGORI,
More informationSpectrum-Revealing Matrix Factorizations Theory and Algorithms
Spectrum-Revealing Matrix Factorizations Theory and Algorithms Ming Gu Department of Mathematics University of California, Berkeley April 5, 2016 Joint work with D. Anderson, J. Deursch, C. Melgaard, J.
More informationParallel Library Software for the Multishift QR Algorithm with Aggressive Early Deflation
Parallel Library Software for the Multishift QR Algorithm with Aggressive Early Deflation ROBERT GRANAT, Umeå University BO KÅGSTRÖM, Umeå University DANIEL KRESSNER, EPF Lausanne MEIYUE SHAO, Umeå University
More informationParallel Circuit Simulation. Alexander Hayman
Parallel Circuit Simulation Alexander Hayman In the News LTspice IV features multi-threaded solvers designed to better utilize current multicore processors. Also included are new SPARSE matrix solvers
More informationQR Decomposition in a Multicore Environment
QR Decomposition in a Multicore Environment Omar Ahsan University of Maryland-College Park Advised by Professor Howard Elman College Park, MD oha@cs.umd.edu ABSTRACT In this study we examine performance
More informationBindel, Fall 2016 Matrix Computations (CS 6210) Notes for
1 Logistics Notes for 2016-09-14 1. There was a goof in HW 2, problem 1 (now fixed) please re-download if you have already started looking at it. 2. CS colloquium (4:15 in Gates G01) this Thurs is Margaret
More informationMulticore Parallelization of Determinant Quantum Monte Carlo Simulations
Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March
More informationA hybrid Hermitian general eigenvalue solver
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,
More informationHierarchical QR factorization algorithms for multi-core cluster systems
Hierarchical QR factorization algorithms for multi-core cluster systems Jack Dongarra Mathieu Faverge Thomas Herault Julien Langou Yves Robert University of Tennessee Knoxville, USA University of Colorado
More informationTradeoffs between synchronization, communication, and work in parallel linear algebra computations
Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel Department of EECS, UC Berkeley February,
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationJ.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009
Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.
More informationINITIAL INTEGRATION AND EVALUATION
INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52
More informationTiled QR factorization algorithms
INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE Tiled QR factorization algorithms Henricus Bouwmeester Mathias Jacuelin Julien Langou Yves Robert N 7601 Avril 2011 Distributed and High
More informationLarge Scale Sparse Linear Algebra
Large Scale Sparse Linear Algebra P. Amestoy (INP-N7, IRIT) A. Buttari (CNRS, IRIT) T. Mary (University of Toulouse, IRIT) A. Guermouche (Univ. Bordeaux, LaBRI), J.-Y. L Excellent (INRIA, LIP, ENS-Lyon)
More informationSparse linear solvers
Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar
More informationOptimized LU-decomposition with Full Pivot for Small Batched Matrices S3069
Optimized LU-decomposition with Full Pivot for Small Batched Matrices S369 Ian Wainwright High Performance Consulting Sweden ian.wainwright@hpcsweden.se Based on work for GTC 212: 1x speed-up vs multi-threaded
More informationReview for the Midterm Exam
Review for the Midterm Exam 1 Three Questions of the Computational Science Prelim scaled speedup network topologies work stealing 2 The in-class Spring 2012 Midterm Exam pleasingly parallel computations
More informationSOLUTION of linear systems of equations of the form:
Proceedings of the Federated Conference on Computer Science and Information Systems pp. Mixed precision iterative refinement techniques for the WZ factorization Beata Bylina Jarosław Bylina Institute of
More informationEvaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries
Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou (ctheodos@grid.auth.gr) User and Application Support Scientific Computing Centre @ AUTH Presentation Outline
More informationThe Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors
Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/09-4 23/September/2010 The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationA communication-avoiding thick-restart Lanczos method on a distributed-memory system
A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart
More informationImprovements for Implicit Linear Equation Solvers
Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS
More informationMatrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing)
Matrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing) alfredo.buttari@enseeiht.fr for an up-to-date version of the slides: http://buttari.perso.enseeiht.fr Introduction Objective
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More information