Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009
|
|
- Victoria Terry
- 6 years ago
- Views:
Transcription
1 III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009
2 Background and motivation Running time of an algorithm is the sum of three items: 1. # flops time-per-flop 2. # words moved/bandwith (memory communication) 3. # messages latency (network communication) Hardward trend: exponentially growing gaps Time-per-flop 1/Memory BW Memory Latency improving 59%/year vs 23%/year vs 5.5%/year Time-per-flop 1/Network BW Network Latency improving 59%/year vs 26%/year vs 15%/year Goal: re-design matrix algorithms to reduce communication and optimize the performance Recent work by Dongarra, Demmel, et al,...
3 Background and motivation Experts benchmark Matrix multiplication of MKL library on Intel Xeon EMT64: cores 8 cores 4 cores 2 cores 1 core 16xdgemm-seq Matrix size matrix size and Gflops
4 Matrix multiplication of ESSL library on an IBM Power 6: cores 16 cores 8 cores 4 cores 2 cores 1 core 32xdgemm-seq Matrix size matrix size and Gflops
5 PLASMA DGETRF on Intel Xeon EMT cores 14 cores 12 cores 10 cores 8 cores 6 cores 4 cores 2 cores 1 core 16xdssssm-seq Matrix size matrix size and Gflops Courtesy of Innovative Computing Laboratory at the University of Tennessee (J. Dongarra et al)
6 PLASMA DGEQRF on Intel Xeon EMT cores 14 cores 12 cores 10 cores 8 cores 6 cores 4 cores 2 cores 1 core 16xdssrfb-seq Matrix size matrix size and Gflops Courtesy of Innovative Computing Laboratory at the University of Tennessee (J. Dongarra et al)
7 PLASMA DGETRF on IBM Power 6: cores 16 cores 8 cores 4 cores 2 cores 1 core xdssssm-seq Matrix size matrix size and Gflops Courtesy of Innovative Computing Laboratory at the University of Tennessee (J. Dongarra et al)
8 PLASMA DGEQRF on IBM Power 6: cores 16 cores 8 cores 4 cores 2 cores 1 core 32xdssrfb-seq Matrix size matrix size and Gflops Innovative Computing Laboratory at the University of Tennessee (J. Dongarra et al)
9 Background and motivation Poor man s benchmark data DGEMM, DGEQRF and DGEQP3 of MKL 10.2 on Intel Core TM i7 Quad 2.66GHz DGEMM DGEQRF DGEQP DGEMM DGEQRF DGEQP Speed (Gflop/s) (quad-core) scalability (n = 12, 000) Message: pivoting is expensive!
10 Background and motivation Poor man s benchmark data PDGEMM, PDGEQRF and PDGEQPF (pivoted QR) of PBLAS and ScaLAPACK on Cray XT4 (Franklin of NERSC): GFlop/s (n = 12000) N. Proc. PDGEMM PDGEQRF PDGEQPF 2x x x x x x Message: pivoting is even more expensive on hybrid systems!
11 10 3 pdgemm pdgeqrf pdgeqpf GFLOPS x2 4x4 6x6 8x8 12x12 16x16 number of processors
12 Question and the rest of my talk Question: How to re-design higher-level matrix algorithms using fastest matrix opertions? Two Case Studies: 1. Communication-reducing algorithm for Green s function calculation and application in Quantum Monte Carlo (QMC) simulations 2. Optimizing Halley s iteration for computing matrix polar decomposition and application in First Principal Molecular Dynamics (FPMD)
13 Communication reducing algorithm for Green s function calculation joint with Roger Lee Wenbin Chen Richard Scalettar Physics, UC Davis National Tsinghua Univ. Taiwan Fudan University, China
14 Computational Material Science Simulation and understanding properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity,... Many body simulation on multi-layer lattice using Hubbard model and quantum Monte Carlo simulation QUEST (QUantum Electron Simulation Toolbox): Fortran 90 package for determinant (and hybrid) Monte Carlo simulations
15 QUEST kernel: Green s function (matrix) calcuation where G = (I + B L B L 1 B 2 B 1 ) 1 B i = B V i = e τk diag(e ±λ ), K is the hopping matrix, an adjance matrix of the lattice τ is the time discretization parameter. L is the time slice λ = cosh 1 (e U τ/2 ). U is a potential energy parameter for local repulsion between electrons Challenge: B L B L 1 B 2 B 1 is extremely ill-conditioned!
16 Physicist s method [Loh et al 89,... 05] 1. Compute an UDT decomposition of the product where D is stratified (graded) B L B L 1 B 2 B 1 = U L D L T L D = X X... X 2. Compute G = T 1 L (UT LT 1 L + D L) 1 U T L. The pivoted QR decomposition is used for the stratification. There are a number of work on the product of matrices by numerical linear algebraists [Stewart,...]
17 Physicist s method [Loh et al] 1. Compute pivoted QR decomposition: B 1 = Q 1 R 1 P T 1 2. Set U 1 = Q 1 D 1 = diag(r 1 ) T 1 = D 1 1 R 1P T 1 3. For i = 2, 3,,L. Compute C i = (B i U i 1 )D i 1. (culumn scaling) Compute pivoted QR decomposition: C i = Q i R i Pi T. Set U i = Q i D i = diag(r i ) T i = (Di 1 R i )(Pi TT i 1) 4. Compute G = T 1 L (UT L T 1 L + D L) 1 U T L.
18 However, pivoting is very expensive on multicore Intel Core TM Quad 2.4GHz, DGEMM: matrix-matrix multiplication DGEQRF: QR decomposition DGEQP3: pivoted QR decomposition Gflop/s DGEMM DGEQRF DGEQP3 Gflop/s DGEMM DGEQRF DGEQP problem size Speed (Gflop/s) number of processors scalability
19 Pivoting is even more expensive on distributed systems Cray XT4 (Franklin of NERSC) 10 3 pdgemm pdgeqrf pdgeqpf GFLOPS x2 4x4 6x6 8x8 12x12 16x16 number of processors
20 SOF algorithm - without the pivoting Structured orthogonal factorization (SOF) algorithm 1. Set M 2 = I, A 2 = B 1 2. For i = 2, 3,,L (a) compute QR decomposition: [ ] [ Mi = B i (b) update [ Ai+1 M i+1 ] = [ 3. Compute G = (M L + A L ) 1 M L Q (i) 11 Q(i) 12 Q (i) 21 Q(i) 22 ] [ Ri 0 ] (Q (i) 12 )T (Q (i) 22 )T ] [ A (i) I Computational kernel: QR decomposition without pivoting. Price: SOF takes about 3 more flops than Physicist method. ]
21 SOF algorithm motiviation and correctness Motiviation: Connection betweenthe Green s function and the inverse of the following block p-cyclic matrix: where G = (I + B L B L 1 B 2 B 1 ) 1 = (L,L)-subblock of M 1 M = Correctness: by direct verification I B 1 B 2 I B 3 I B L I
22 SOF algorithm correctness By the QR decomposition, we have ( ( T Q (i) Q 22) (i) Bi = 0 12) T Mi ( ) T Since M 2 = I and M i = Q (i 1) 22 for i > 2, the above equations can be rewritten as ( ) T ( ) T Q (2) 12 = Q (2) 22 B2 ( ) T ( ) T ( T = Q 22) (i) Bi for i 3 Q (i) 12 Q (i 1) 22 Note that Q (i) 22 are invertible, ( ) T ( ) T Q (2) 12 = Q (2) 22 B2 ( ) T ( ) T ( = Bi Q (i) 12 Q (i) 22 Q (i 1) 22 ) T for i 3
23 Hence ( ) T Q (L) 12 ( Q (3) 12 ) T ( Q (2) 12 By steps 1 and 2(b) of SOF, ( ) T A L = ( Hence In summary, (M L + A L ) 1 M L = A L = Q (L) 12 ( Q (L) 22 ) T = ( Q (3) 12 [ ( ) T ( Q (L) 22 + Q (L) 22 ) T ( Q (2) ) T BL B 3 B ) T B1. ) T BL B 3 B 2 B 1. Q (L) 22 = (I + B L B 2 B 1 ) 1 = G. ) T BL B 2 B 1 ] 1 ( Q (L) 22 ) T
24 SOF algorithm Implementation QR decomposition in LAPACK: 1. DGEQRF: compute the QR decomposition, Q is stored implicitly 2. DORGQR: forming the full Q-factor explicitly Q = I V TV T, T is an n n upper triangular, and V is 2n n (DLARFT).
25 SOF algorithm Implementation 1 Call DGEQRF for QR decomposition explicitly form the full Q-factor by DORGQR.
26 SOF algorithm Implementation 2 Modify DORGQR to compute Q (i) 12 Let V be comformally partitioned V = and Q(i) 22 directly: [ Vu V d ], (1) Then the right half of the Q-factor can be computed by Q (i) 12 = V utvd T Q (i) 22 = I V dtvd T. Note that V u is a lower triangular matrix, used to reduce the cost of matrix multiplication.
27 SOF algorithm Implementation 3 Modify DGEQRF to compute the T matrix recursively. If T 11 T 12 T 1b T = T 22 T 2b...., T bb DGEQRF produces diagonal blocks T ii in order explicitly. Consider two consecutive block Householder transformations H 1 and H 2. Then H 1 H 2 = (I V 1 T 1 V1 T [ )(I V 2 T 2 V2 T ) T1 T = I (V 1 V 2 ) 1 V1 T V 2 T 2 0 T 2 ][ V T 1 V T 2 The T matrix of the merged Householder transformation can be computed with additional computation of T 1 V T 1 V 2 T 2. ],
28 Performance I Green s matrix on a lattice (n = 1024) CPU elapsed time 1 core 2 cores 4 cores Loh s Impl Impl G = (I + B 96 B 95 B 2 B 1 ) 1 GFlop/s rate: 1 core 2 cores 4 cores Loh s Impl Impl Intel Core TM i7 Quad 2.66GHz, ifort Fortran and MKL
29 Performance II Green s matrix on a lattice (n = 4096) CPU elapsed time No. of Proc. (4x4) (8x8) (16x16) (32x32) Loh s Impl Impl G = (I + B 12 B 11 B 2 B 1 ) 1 GFlop/s rate: No. of Proc. (4x4) (8x8) (16x16) (32x32) Loh s Impl Impl Intel Core TM i7 Quad 2.66GHz, ifort Fortran and MKL
30 Further reading - I Green s function calculation, may appear in different form, is one of fundamental problems in nano-scale simulations. The physical background and related numerical linear algebra problems are discussed in detail in the following reference Z. Bai, W. Chen, R. Scalettar, I. Yamazaki, Numerical Methods for Quantum Monte Carlo Simulations of the Hubbard Model In Multi-Scale Phenomena in Complex Fluids edited by T. Y. Hou et al, Higher Education Press, 2009 (114 pages) A pre-print of the reference is available at bai/bcsy09.pdf
31 Optimized Halley s iteration for computing matrix polar decomposition joint with Yuji Nakatsukasa Applied Math, UC Davis Francois Gygi Applied Science/CS, UC Davis
32 Electronic structure calculations Investigate the properties of solids, liquids, biomolecules, and nanoparticles... First-principles molecular dynamics (FPMD) based on the plane-wave, pseudopotential formalism Qbox: a C++/MPI implementation of FPMD for massively parallel computers (F. Gygi et al)
33 Subspace alignment in Self-consistent iteration Given wave functions Y (t 1) and Y (t) at time steps t 1 and t: Compute QR decomposition Y (t) = Ŷ (t)r (tall and skinny QR) Solve the Procrustes problem min Q Y (t 1) Y (t)q F Compute Ỹ = 2Ŷ (t)q Y (t 1) Obtain trial wavefunction Ŷ (t + 1) for the time step t + 1 by QR: Ỹ (t) = Ŷ (t + 1)R. Ref: [Arias et al 92, Edelman et al 99, Fattebert and Gygi 04]
34 Matrix polar decomposition Given A C n n, compute the polar decomposition where A = UH, U H U = I and H H = H 0 Application: orthogonal procrustes problem in subspace alignment of SCF iteration for solving Kohn-Sham equation in electronic structure calculation: min U X i X i U
35 Newton s iteration Newton s iteration: X k+1 = 1 2 ( ) Xk + X H k, X0 = A. X k U as k Singular values σ i (X k ) 1 as k The convergence is slow when A is ill-conditioned. Scaled Newton s iteration X k+1 = 1 2 ( ζk X k + (ζ k X k ) H), X 0 = A, The inverse X 1 k has to be computed explicitly....
36 Newton s iteration variant Scaled Newton s iteration variant (SNV): X k+1 = 2η k X k ( I + η 2 k X H k X k ) 1, Y0 = η 0 A, Inverse-free implementation: SNV can be implemented using an QR decomposition (discussed later). Unfortunately, even the inverse-free implementation is not numerical stability? discovered by [Crudge 98] and [Byers and Xu 01], independently
37 Halley s iteration Halley s iteration X k+1 = X k (3I + X H k X k )(I + 3X H k X k ) 1, X 0 = A. There is a family of iterations based on rational function approximation Theorem. X k U and the asymptotic rate is cubic. The initial steps could be still slow Consider the 2 2 matrix: X 0 = A = [ The kth Halley s iterate is given by [ ] 1 X k =, x x k = x k 1(3 + x 2 k 1 ) k 1 + 3x 2. k 1 It takes 24 iterations to satisfy 1 X 24 < ].
38 Dynamically weighted Halley s iteration Dynamically weighted Halley s (DWH) iteration X k+1 = X k (a k I + b k X H k X k )(I + c k X H k X k ) 1, where X 0 = A/α and α A 2. Question: how to dynamically choose a k, b k and c k to optimize the convergence rate?
39 How to dynamically determine a k, b k and c k? - first step Let l 0 be chosen such that σ i (X 1 ) = g 0 (σ i (X 0 )), and where [σ min (X 0 ), σ max (X 0 )] [l 0, 1] (0, 1]. [σ min (X 1 ), σ max (X 1 )] [ming 0 (x), maxg 0 (x)]. g 0 (x) = x a 0 + b 0 x c 0 x. 2 Rational optimization problem: find a 0, b 0, c 0 such that (1) 0 < g 0 (x) 1 for l 0 < x 1. { } (2) the max-min property max a 0,b 0,c 0 Update l 1 = min l0 x 1 g 0 (x), such that min g 0(x) l 0 x 1 [σ min (X 1 ), σ max (X 1 )] [l 1, 1] (0, 1]..
40 How to dynamically determine a k, b k and c k? general case At the (k + 1)st iteration, we have and where [σ min (X k ), σ max (X k )] [l k, 1] (0, 1]. [σ min (X k+1 ), σ max (X k+1 )] [ ] min g k(x), max g k(x), l k x 1 l k x 1 g k (x) = x a k + b k x c k x. 2 Rational optimization problem: find a k, b k and c k such that: (1) 0 < g k (x) 1 for l k < x{ 1. } (2) the max-min property max a k,b k,c k Update l k+1 = min lk x 1 g k (x) min g k(x) l k x 1.
41 Solving rational optimization problem For the special three-parameter case, we have developed a direct method to obtain analytic expressions for a k,b k and c k tedious, but doable In general, for high-order Halley s iteration we can first reformulate the problem, and then use a SDP solver, say SeDuMi or YALMIP Applications of rational optimization problem in shape optimization of transfer functions [Nie et al 09]
42 Convergence theorem and number of of iterations Theorem. X k U and the asymptotic rate is cubic. For the 2 2 example, 5 itertions such that 1 X 5 < In general, we have the following theoretical prediction for the number of iterations κ(a) SN DWH
43 Inverse-free implementation Theorem. Given the QR decomposition [ ] [ ] ηa Q1 = R. I Then Q 2 Q 1 Q H 2 = ηa(i + η 2 A H A) 1. [Zha & Zhang 96, Higham 07]
44 Inverse-free implementation - DWH Dynamically weighted Halley s (DWH) iteration X k+1 = X k (a k I + b k X H k X k )(I + c k X H k X k ) 1 = b k c k X k + (a k b k c k ) X k (I + c k X H k X k ) 1 }{{} QR based implementation of DHW: [ ] [ ] ck X k Q1 = R, I Q 2 X k+1 = b k ck X k + 1 ck (a k b k ck )Q 1 Q H 2,
45 Example: verification of theoretical iteration count Diagonal matrices A. Iter numbers and X k I F κ(a) Halley Gander DWH Halley 4.7e e e e e e-16 Gander 7.6e e e e e e-16 DWH 4.9e e e e e e-16
46 Example: Q-DWH and scaled Newton s iteration Random matrices with varying condition numbers: kappa min max min max min max iter Q-DWH SN res Q-DWH 5.3e e e e e e-16 SN 7.4e e e e e e-15 Multicore implementation is progress.
47 Further reading - II The inverse-free implementation of the Halley s iteration is described in the following paper Y. Nakatsukasa, Z. Bai and F. Gygi, Optimizing Halley s iteration for computing the matrix polar decomposition, submitted, 2009 It is available on our class website.
48 Concluding remarks 1. Our case studies illustrate the ideas of designing communication-reducing numerical linear algebra algorithms to balance flops, communication, flops, stability and optimize the performance on modern computer architecture. 2. Hardware trend means the time has come to do! 3. Lots of prior work in numerical linear algebra community, but more open problems 4. Why stops at numerical linear algebra?
Multicore Parallelization of Determinant Quantum Monte Carlo Simulations
Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March
More informationMulti-Length Scale Matrix Computations and Applications in Quantum Mechanical Simulations
Multi-Length Scale Matrix Computations and Applications in Quantum Mechanical Simulations Zhaojun Bai http://www.cs.ucdavis.edu/ bai joint work with Wenbin Chen, Roger Lee, Richard Scalettar, Ichitaro
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationAlgorithms and Perturbation Theory for Matrix Eigenvalue Problems and the SVD
Algorithms and Perturbation Theory for Matrix Eigenvalue Problems and the SVD Yuji Nakatsukasa PhD dissertation University of California, Davis Supervisor: Roland Freund Householder 2014 2/28 Acknowledgment
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationCommunication-avoiding LU and QR factorizations for multicore architectures
Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012
MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7
More informationStructured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators
Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators Sergiy Gogolenko, Zhaojun Bai 2, and Richard Scalettar 2 Donetsk National Technical University, Donetsk, 8300,
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationMAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors
MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationQUEST: QUantum Electron Simulation Toolbox
San Jose State University From the SelectedWorks of Ehsan Khatami July, 2010 QUEST: QUantum Electron Simulation Toolbox C.-R. Lee, National Tsinghua University S. Chiesa, University of California, Davis
More informationAvoiding Communication in Distributed-Memory Tridiagonalization
Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationEvaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries
Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou (ctheodos@grid.auth.gr) User and Application Support Scientific Computing Centre @ AUTH Presentation Outline
More informationDynamic Scheduling within MAGMA
Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationEnhancing Performance of Tall-Skinny QR Factorization using FPGAs
Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common
More informationLAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.
LAPACK-Style Codes for Pivoted Cholesky and QR Updating Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig 2007 MIMS EPrint: 2006.385 Manchester Institute for Mathematical Sciences School of Mathematics
More informationParallelization of the Molecular Orbital Program MOS-F
Parallelization of the Molecular Orbital Program MOS-F Akira Asato, Satoshi Onodera, Yoshie Inada, Elena Akhmatskaya, Ross Nobes, Azuma Matsuura, Atsuya Takahashi November 2003 Fujitsu Laboratories of
More informationAdvancing Large Scale Many-Body QMC Simulations on GPU Accelerated Multicore Systems
Advancing Large Scale Many-Body QMC Simulations on GPU Accelerated Multicore Systems Andres Tomas, Chia-Chen Chang, Richard Scalettar and Zhaojun Bai Department of Computer Science, University of California,
More informationA hybrid Hermitian general eigenvalue solver
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,
More informationParallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems
Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems G. Hager HPC Services, Computing Center Erlangen, Germany E. Jeckelmann Theoretical Physics, Univ.
More informationab initio Electronic Structure Calculations
ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab
More informationSonderforschungsbereich 393
Sonderforschungsbereich 393 Parallele Numerische Simulation für Physik und Kontinuumsmechanik Peter Benner Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Solving Stable Sylvester Equations via Rational
More informationPreconditioned Parallel Block Jacobi SVD Algorithm
Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic
More informationParallel Eigensolver Performance on High Performance Computers
Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationTall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222
Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationA communication-avoiding thick-restart Lanczos method on a distributed-memory system
A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart
More informationTile QR Factorization with Parallel Panel Processing for Multicore Architectures
Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University
More informationThe Future of LAPACK and ScaLAPACK
The Future of LAPACK and ScaLAPACK Jason Riedy, Yozo Hida, James Demmel EECS Department University of California, Berkeley November 18, 2005 Outline Survey responses: What users want Improving LAPACK and
More informationPracticality of Large Scale Fast Matrix Multiplication
Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by
More informationPerformance optimization of WEST and Qbox on Intel Knights Landing
Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University
More informationBinding Performance and Power of Dense Linear Algebra Operations
10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique
More informationLAPACK-Style Codes for Pivoted Cholesky and QR Updating
LAPACK-Style Codes for Pivoted Cholesky and QR Updating Sven Hammarling 1, Nicholas J. Higham 2, and Craig Lucas 3 1 NAG Ltd.,Wilkinson House, Jordan Hill Road, Oxford, OX2 8DR, England, sven@nag.co.uk,
More informationPerformance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs
Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Théo Mary, Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Jack Dongarra presenter 1 Low-Rank
More informationLU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version
LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Amal Khabou James Demmel Laura Grigori Ming Gu Electrical Engineering and Computer Sciences University of California
More informationScalable numerical algorithms for electronic structure calculations
Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster
More informationLU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version
1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2
More informationAlgorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method
Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk
More informationA dissection solver with kernel detection for unsymmetric matrices in FreeFem++
. p.1/21 11 Dec. 2014, LJLL, Paris FreeFem++ workshop A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ Atsushi Suzuki Atsushi.Suzuki@ann.jussieu.fr Joint work with François-Xavier
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationMaking electronic structure methods scale: Large systems and (massively) parallel computing
AB Making electronic structure methods scale: Large systems and (massively) parallel computing Ville Havu Department of Applied Physics Helsinki University of Technology - TKK Ville.Havu@tkk.fi 1 Outline
More informationINITIAL INTEGRATION AND EVALUATION
INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52
More informationSymmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano
Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization
More informationMARCH 24-27, 2014 SAN JOSE, CA
MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =
More informationOn aggressive early deflation in parallel variants of the QR algorithm
On aggressive early deflation in parallel variants of the QR algorithm Bo Kågström 1, Daniel Kressner 2, and Meiyue Shao 1 1 Department of Computing Science and HPC2N Umeå University, S-901 87 Umeå, Sweden
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationS N. hochdimensionaler Lyapunov- und Sylvestergleichungen. Peter Benner. Mathematik in Industrie und Technik Fakultät für Mathematik TU Chemnitz
Ansätze zur numerischen Lösung hochdimensionaler Lyapunov- und Sylvestergleichungen Peter Benner Mathematik in Industrie und Technik Fakultät für Mathematik TU Chemnitz S N SIMULATION www.tu-chemnitz.de/~benner
More informationRoundoff Error. Monday, August 29, 11
Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate
More informationTheoretical Computer Science
Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized
More informationOutline. Recursive QR factorization. Hybrid Recursive QR factorization. The linear least squares problem. SMP algorithms and implementations
Outline Recursie QR factorization Hybrid Recursie QR factorization he linear least squares problem SMP algorithms and implementations Conclusions Recursion Automatic ariable blocking (e.g., QR factorization)
More informationSpectrum-Revealing Matrix Factorizations Theory and Algorithms
Spectrum-Revealing Matrix Factorizations Theory and Algorithms Ming Gu Department of Mathematics University of California, Berkeley April 5, 2016 Joint work with D. Anderson, J. Deursch, C. Melgaard, J.
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationCALU: A Communication Optimal LU Factorization Algorithm
CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9
More information1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria
1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation
More informationDirect methods for symmetric eigenvalue problems
Direct methods for symmetric eigenvalue problems, PhD McMaster University School of Computational Engineering and Science February 4, 2008 1 Theoretical background Posing the question Perturbation theory
More informationSome notes on efficient computing and setting up high performance computing environments
Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient
More informationLecture 8: Fast Linear Solvers (Part 7)
Lecture 8: Fast Linear Solvers (Part 7) 1 Modified Gram-Schmidt Process with Reorthogonalization Test Reorthogonalization If Av k 2 + δ v k+1 2 = Av k 2 to working precision. δ = 10 3 2 Householder Arnoldi
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationThe Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors
Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/09-4 23/September/2010 The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors
More informationClaude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique
Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)
More informationIntroduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra
Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and
More informationOn GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations
Max Planck Institute Magdeburg Preprints Martin Köhler Jens Saak On GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations MAX PLANCK INSTITUT FÜR DYNAMIK KOMPLEXER TECHNISCHER
More informationCyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions
Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of
More informationA High-Performance Parallel Hybrid Method for Large Sparse Linear Systems
Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,
More informationThe Ongoing Development of CSDP
The Ongoing Development of CSDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Joseph Young Department of Mathematics New Mexico Tech (Now at Rice University)
More informationLattice Boltzmann simulations on heterogeneous CPU-GPU clusters
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts
More informationSusumu YAMADA 1,3 Toshiyuki IMAMURA 2,3, Masahiko MACHIDA 1,3
Dynamical Variation of Eigenvalue Problems in Density-Matrix Renormalization-Group Code PP12, Feb. 15, 2012 1 Center for Computational Science and e-systems, Japan Atomic Energy Agency 2 The University
More informationAlgorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems
Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems LESLIE FOSTER and RAJESH KOMMU San Jose State University Existing routines, such as xgelsy or xgelsd in LAPACK, for
More informationNotes on Householder QR Factorization
Notes on Householder QR Factorization Robert A van de Geijn Department of Computer Science he University of exas at Austin Austin, X 7872 rvdg@csutexasedu September 2, 24 Motivation A fundamental problem
More information2.5D algorithms for distributed-memory computing
ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster
More informationHPMPC - A new software package with efficient solvers for Model Predictive Control
- A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model
More informationFinite-choice algorithm optimization in Conjugate Gradients
Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the
More informationLecture 13 Stability of LU Factorization; Cholesky Factorization. Songting Luo. Department of Mathematics Iowa State University
Lecture 13 Stability of LU Factorization; Cholesky Factorization Songting Luo Department of Mathematics Iowa State University MATH 562 Numerical Analysis II ongting Luo ( Department of Mathematics Iowa
More informationPorting a Sphere Optimization Program from lapack to scalapack
Porting a Sphere Optimization Program from lapack to scalapack Paul C. Leopardi Robert S. Womersley 12 October 2008 Abstract The sphere optimization program sphopt was originally written as a sequential
More informationPorting a sphere optimization program from LAPACK to ScaLAPACK
Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference
More informationLarge Scale Sparse Linear Algebra
Large Scale Sparse Linear Algebra P. Amestoy (INP-N7, IRIT) A. Buttari (CNRS, IRIT) T. Mary (University of Toulouse, IRIT) A. Guermouche (Univ. Bordeaux, LaBRI), J.-Y. L Excellent (INRIA, LIP, ENS-Lyon)
More informationSparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More informationParallel Singular Value Decomposition. Jiaxing Tan
Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate SVD? How to parallelize SVD? Future Work What is SVD? Matrix Decomposition Eigen Decomposition A (non-zero) vector
More informationMinimizing Communication in Linear Algebra. James Demmel 15 June
Minimizing Communication in Linear Algebra James Demmel 15 June 2010 www.cs.berkeley.edu/~demmel 1 Outline What is communication and why is it important to avoid? Direct Linear Algebra Lower bounds on
More informationA New Block Algorithm for Full-Rank Solution of the Sylvester-observer Equation.
1 A New Block Algorithm for Full-Rank Solution of the Sylvester-observer Equation João Carvalho, DMPA, Universidade Federal do RS, Brasil Karabi Datta, Dep MSc, Northern Illinois University, DeKalb, IL
More informationEfficient implementation of the overlap operator on multi-gpus
Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator
More informationA DIVIDE-AND-CONQUER METHOD FOR THE TAKAGI FACTORIZATION
SIAM J MATRIX ANAL APPL Vol 0, No 0, pp 000 000 c XXXX Society for Industrial and Applied Mathematics A DIVIDE-AND-CONQUER METHOD FOR THE TAKAGI FACTORIZATION WEI XU AND SANZHENG QIAO Abstract This paper
More informationA distributed packed storage for large dense parallel in-core calculations
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A
More informationOpportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem
Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Peter Benner, Andreas Marek, Carolin Penke August 16, 2018 ELSI Workshop 2018 Partners: The Problem The Bethe-Salpeter
More informationAlgorithms and Methods for Fast Model Predictive Control
Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model
More informationIntroduction to communication avoiding linear algebra algorithms in high performance computing
Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for
More information