Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009

Size: px
Start display at page:

Download "Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009"

Transcription

1 III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009

2 Background and motivation Running time of an algorithm is the sum of three items: 1. # flops time-per-flop 2. # words moved/bandwith (memory communication) 3. # messages latency (network communication) Hardward trend: exponentially growing gaps Time-per-flop 1/Memory BW Memory Latency improving 59%/year vs 23%/year vs 5.5%/year Time-per-flop 1/Network BW Network Latency improving 59%/year vs 26%/year vs 15%/year Goal: re-design matrix algorithms to reduce communication and optimize the performance Recent work by Dongarra, Demmel, et al,...

3 Background and motivation Experts benchmark Matrix multiplication of MKL library on Intel Xeon EMT64: cores 8 cores 4 cores 2 cores 1 core 16xdgemm-seq Matrix size matrix size and Gflops

4 Matrix multiplication of ESSL library on an IBM Power 6: cores 16 cores 8 cores 4 cores 2 cores 1 core 32xdgemm-seq Matrix size matrix size and Gflops

5 PLASMA DGETRF on Intel Xeon EMT cores 14 cores 12 cores 10 cores 8 cores 6 cores 4 cores 2 cores 1 core 16xdssssm-seq Matrix size matrix size and Gflops Courtesy of Innovative Computing Laboratory at the University of Tennessee (J. Dongarra et al)

6 PLASMA DGEQRF on Intel Xeon EMT cores 14 cores 12 cores 10 cores 8 cores 6 cores 4 cores 2 cores 1 core 16xdssrfb-seq Matrix size matrix size and Gflops Courtesy of Innovative Computing Laboratory at the University of Tennessee (J. Dongarra et al)

7 PLASMA DGETRF on IBM Power 6: cores 16 cores 8 cores 4 cores 2 cores 1 core xdssssm-seq Matrix size matrix size and Gflops Courtesy of Innovative Computing Laboratory at the University of Tennessee (J. Dongarra et al)

8 PLASMA DGEQRF on IBM Power 6: cores 16 cores 8 cores 4 cores 2 cores 1 core 32xdssrfb-seq Matrix size matrix size and Gflops Innovative Computing Laboratory at the University of Tennessee (J. Dongarra et al)

9 Background and motivation Poor man s benchmark data DGEMM, DGEQRF and DGEQP3 of MKL 10.2 on Intel Core TM i7 Quad 2.66GHz DGEMM DGEQRF DGEQP DGEMM DGEQRF DGEQP Speed (Gflop/s) (quad-core) scalability (n = 12, 000) Message: pivoting is expensive!

10 Background and motivation Poor man s benchmark data PDGEMM, PDGEQRF and PDGEQPF (pivoted QR) of PBLAS and ScaLAPACK on Cray XT4 (Franklin of NERSC): GFlop/s (n = 12000) N. Proc. PDGEMM PDGEQRF PDGEQPF 2x x x x x x Message: pivoting is even more expensive on hybrid systems!

11 10 3 pdgemm pdgeqrf pdgeqpf GFLOPS x2 4x4 6x6 8x8 12x12 16x16 number of processors

12 Question and the rest of my talk Question: How to re-design higher-level matrix algorithms using fastest matrix opertions? Two Case Studies: 1. Communication-reducing algorithm for Green s function calculation and application in Quantum Monte Carlo (QMC) simulations 2. Optimizing Halley s iteration for computing matrix polar decomposition and application in First Principal Molecular Dynamics (FPMD)

13 Communication reducing algorithm for Green s function calculation joint with Roger Lee Wenbin Chen Richard Scalettar Physics, UC Davis National Tsinghua Univ. Taiwan Fudan University, China

14 Computational Material Science Simulation and understanding properties of solid-state materials: magnetism, metal-insulator transition, high temperature superconductivity,... Many body simulation on multi-layer lattice using Hubbard model and quantum Monte Carlo simulation QUEST (QUantum Electron Simulation Toolbox): Fortran 90 package for determinant (and hybrid) Monte Carlo simulations

15 QUEST kernel: Green s function (matrix) calcuation where G = (I + B L B L 1 B 2 B 1 ) 1 B i = B V i = e τk diag(e ±λ ), K is the hopping matrix, an adjance matrix of the lattice τ is the time discretization parameter. L is the time slice λ = cosh 1 (e U τ/2 ). U is a potential energy parameter for local repulsion between electrons Challenge: B L B L 1 B 2 B 1 is extremely ill-conditioned!

16 Physicist s method [Loh et al 89,... 05] 1. Compute an UDT decomposition of the product where D is stratified (graded) B L B L 1 B 2 B 1 = U L D L T L D = X X... X 2. Compute G = T 1 L (UT LT 1 L + D L) 1 U T L. The pivoted QR decomposition is used for the stratification. There are a number of work on the product of matrices by numerical linear algebraists [Stewart,...]

17 Physicist s method [Loh et al] 1. Compute pivoted QR decomposition: B 1 = Q 1 R 1 P T 1 2. Set U 1 = Q 1 D 1 = diag(r 1 ) T 1 = D 1 1 R 1P T 1 3. For i = 2, 3,,L. Compute C i = (B i U i 1 )D i 1. (culumn scaling) Compute pivoted QR decomposition: C i = Q i R i Pi T. Set U i = Q i D i = diag(r i ) T i = (Di 1 R i )(Pi TT i 1) 4. Compute G = T 1 L (UT L T 1 L + D L) 1 U T L.

18 However, pivoting is very expensive on multicore Intel Core TM Quad 2.4GHz, DGEMM: matrix-matrix multiplication DGEQRF: QR decomposition DGEQP3: pivoted QR decomposition Gflop/s DGEMM DGEQRF DGEQP3 Gflop/s DGEMM DGEQRF DGEQP problem size Speed (Gflop/s) number of processors scalability

19 Pivoting is even more expensive on distributed systems Cray XT4 (Franklin of NERSC) 10 3 pdgemm pdgeqrf pdgeqpf GFLOPS x2 4x4 6x6 8x8 12x12 16x16 number of processors

20 SOF algorithm - without the pivoting Structured orthogonal factorization (SOF) algorithm 1. Set M 2 = I, A 2 = B 1 2. For i = 2, 3,,L (a) compute QR decomposition: [ ] [ Mi = B i (b) update [ Ai+1 M i+1 ] = [ 3. Compute G = (M L + A L ) 1 M L Q (i) 11 Q(i) 12 Q (i) 21 Q(i) 22 ] [ Ri 0 ] (Q (i) 12 )T (Q (i) 22 )T ] [ A (i) I Computational kernel: QR decomposition without pivoting. Price: SOF takes about 3 more flops than Physicist method. ]

21 SOF algorithm motiviation and correctness Motiviation: Connection betweenthe Green s function and the inverse of the following block p-cyclic matrix: where G = (I + B L B L 1 B 2 B 1 ) 1 = (L,L)-subblock of M 1 M = Correctness: by direct verification I B 1 B 2 I B 3 I B L I

22 SOF algorithm correctness By the QR decomposition, we have ( ( T Q (i) Q 22) (i) Bi = 0 12) T Mi ( ) T Since M 2 = I and M i = Q (i 1) 22 for i > 2, the above equations can be rewritten as ( ) T ( ) T Q (2) 12 = Q (2) 22 B2 ( ) T ( ) T ( T = Q 22) (i) Bi for i 3 Q (i) 12 Q (i 1) 22 Note that Q (i) 22 are invertible, ( ) T ( ) T Q (2) 12 = Q (2) 22 B2 ( ) T ( ) T ( = Bi Q (i) 12 Q (i) 22 Q (i 1) 22 ) T for i 3

23 Hence ( ) T Q (L) 12 ( Q (3) 12 ) T ( Q (2) 12 By steps 1 and 2(b) of SOF, ( ) T A L = ( Hence In summary, (M L + A L ) 1 M L = A L = Q (L) 12 ( Q (L) 22 ) T = ( Q (3) 12 [ ( ) T ( Q (L) 22 + Q (L) 22 ) T ( Q (2) ) T BL B 3 B ) T B1. ) T BL B 3 B 2 B 1. Q (L) 22 = (I + B L B 2 B 1 ) 1 = G. ) T BL B 2 B 1 ] 1 ( Q (L) 22 ) T

24 SOF algorithm Implementation QR decomposition in LAPACK: 1. DGEQRF: compute the QR decomposition, Q is stored implicitly 2. DORGQR: forming the full Q-factor explicitly Q = I V TV T, T is an n n upper triangular, and V is 2n n (DLARFT).

25 SOF algorithm Implementation 1 Call DGEQRF for QR decomposition explicitly form the full Q-factor by DORGQR.

26 SOF algorithm Implementation 2 Modify DORGQR to compute Q (i) 12 Let V be comformally partitioned V = and Q(i) 22 directly: [ Vu V d ], (1) Then the right half of the Q-factor can be computed by Q (i) 12 = V utvd T Q (i) 22 = I V dtvd T. Note that V u is a lower triangular matrix, used to reduce the cost of matrix multiplication.

27 SOF algorithm Implementation 3 Modify DGEQRF to compute the T matrix recursively. If T 11 T 12 T 1b T = T 22 T 2b...., T bb DGEQRF produces diagonal blocks T ii in order explicitly. Consider two consecutive block Householder transformations H 1 and H 2. Then H 1 H 2 = (I V 1 T 1 V1 T [ )(I V 2 T 2 V2 T ) T1 T = I (V 1 V 2 ) 1 V1 T V 2 T 2 0 T 2 ][ V T 1 V T 2 The T matrix of the merged Householder transformation can be computed with additional computation of T 1 V T 1 V 2 T 2. ],

28 Performance I Green s matrix on a lattice (n = 1024) CPU elapsed time 1 core 2 cores 4 cores Loh s Impl Impl G = (I + B 96 B 95 B 2 B 1 ) 1 GFlop/s rate: 1 core 2 cores 4 cores Loh s Impl Impl Intel Core TM i7 Quad 2.66GHz, ifort Fortran and MKL

29 Performance II Green s matrix on a lattice (n = 4096) CPU elapsed time No. of Proc. (4x4) (8x8) (16x16) (32x32) Loh s Impl Impl G = (I + B 12 B 11 B 2 B 1 ) 1 GFlop/s rate: No. of Proc. (4x4) (8x8) (16x16) (32x32) Loh s Impl Impl Intel Core TM i7 Quad 2.66GHz, ifort Fortran and MKL

30 Further reading - I Green s function calculation, may appear in different form, is one of fundamental problems in nano-scale simulations. The physical background and related numerical linear algebra problems are discussed in detail in the following reference Z. Bai, W. Chen, R. Scalettar, I. Yamazaki, Numerical Methods for Quantum Monte Carlo Simulations of the Hubbard Model In Multi-Scale Phenomena in Complex Fluids edited by T. Y. Hou et al, Higher Education Press, 2009 (114 pages) A pre-print of the reference is available at bai/bcsy09.pdf

31 Optimized Halley s iteration for computing matrix polar decomposition joint with Yuji Nakatsukasa Applied Math, UC Davis Francois Gygi Applied Science/CS, UC Davis

32 Electronic structure calculations Investigate the properties of solids, liquids, biomolecules, and nanoparticles... First-principles molecular dynamics (FPMD) based on the plane-wave, pseudopotential formalism Qbox: a C++/MPI implementation of FPMD for massively parallel computers (F. Gygi et al)

33 Subspace alignment in Self-consistent iteration Given wave functions Y (t 1) and Y (t) at time steps t 1 and t: Compute QR decomposition Y (t) = Ŷ (t)r (tall and skinny QR) Solve the Procrustes problem min Q Y (t 1) Y (t)q F Compute Ỹ = 2Ŷ (t)q Y (t 1) Obtain trial wavefunction Ŷ (t + 1) for the time step t + 1 by QR: Ỹ (t) = Ŷ (t + 1)R. Ref: [Arias et al 92, Edelman et al 99, Fattebert and Gygi 04]

34 Matrix polar decomposition Given A C n n, compute the polar decomposition where A = UH, U H U = I and H H = H 0 Application: orthogonal procrustes problem in subspace alignment of SCF iteration for solving Kohn-Sham equation in electronic structure calculation: min U X i X i U

35 Newton s iteration Newton s iteration: X k+1 = 1 2 ( ) Xk + X H k, X0 = A. X k U as k Singular values σ i (X k ) 1 as k The convergence is slow when A is ill-conditioned. Scaled Newton s iteration X k+1 = 1 2 ( ζk X k + (ζ k X k ) H), X 0 = A, The inverse X 1 k has to be computed explicitly....

36 Newton s iteration variant Scaled Newton s iteration variant (SNV): X k+1 = 2η k X k ( I + η 2 k X H k X k ) 1, Y0 = η 0 A, Inverse-free implementation: SNV can be implemented using an QR decomposition (discussed later). Unfortunately, even the inverse-free implementation is not numerical stability? discovered by [Crudge 98] and [Byers and Xu 01], independently

37 Halley s iteration Halley s iteration X k+1 = X k (3I + X H k X k )(I + 3X H k X k ) 1, X 0 = A. There is a family of iterations based on rational function approximation Theorem. X k U and the asymptotic rate is cubic. The initial steps could be still slow Consider the 2 2 matrix: X 0 = A = [ The kth Halley s iterate is given by [ ] 1 X k =, x x k = x k 1(3 + x 2 k 1 ) k 1 + 3x 2. k 1 It takes 24 iterations to satisfy 1 X 24 < ].

38 Dynamically weighted Halley s iteration Dynamically weighted Halley s (DWH) iteration X k+1 = X k (a k I + b k X H k X k )(I + c k X H k X k ) 1, where X 0 = A/α and α A 2. Question: how to dynamically choose a k, b k and c k to optimize the convergence rate?

39 How to dynamically determine a k, b k and c k? - first step Let l 0 be chosen such that σ i (X 1 ) = g 0 (σ i (X 0 )), and where [σ min (X 0 ), σ max (X 0 )] [l 0, 1] (0, 1]. [σ min (X 1 ), σ max (X 1 )] [ming 0 (x), maxg 0 (x)]. g 0 (x) = x a 0 + b 0 x c 0 x. 2 Rational optimization problem: find a 0, b 0, c 0 such that (1) 0 < g 0 (x) 1 for l 0 < x 1. { } (2) the max-min property max a 0,b 0,c 0 Update l 1 = min l0 x 1 g 0 (x), such that min g 0(x) l 0 x 1 [σ min (X 1 ), σ max (X 1 )] [l 1, 1] (0, 1]..

40 How to dynamically determine a k, b k and c k? general case At the (k + 1)st iteration, we have and where [σ min (X k ), σ max (X k )] [l k, 1] (0, 1]. [σ min (X k+1 ), σ max (X k+1 )] [ ] min g k(x), max g k(x), l k x 1 l k x 1 g k (x) = x a k + b k x c k x. 2 Rational optimization problem: find a k, b k and c k such that: (1) 0 < g k (x) 1 for l k < x{ 1. } (2) the max-min property max a k,b k,c k Update l k+1 = min lk x 1 g k (x) min g k(x) l k x 1.

41 Solving rational optimization problem For the special three-parameter case, we have developed a direct method to obtain analytic expressions for a k,b k and c k tedious, but doable In general, for high-order Halley s iteration we can first reformulate the problem, and then use a SDP solver, say SeDuMi or YALMIP Applications of rational optimization problem in shape optimization of transfer functions [Nie et al 09]

42 Convergence theorem and number of of iterations Theorem. X k U and the asymptotic rate is cubic. For the 2 2 example, 5 itertions such that 1 X 5 < In general, we have the following theoretical prediction for the number of iterations κ(a) SN DWH

43 Inverse-free implementation Theorem. Given the QR decomposition [ ] [ ] ηa Q1 = R. I Then Q 2 Q 1 Q H 2 = ηa(i + η 2 A H A) 1. [Zha & Zhang 96, Higham 07]

44 Inverse-free implementation - DWH Dynamically weighted Halley s (DWH) iteration X k+1 = X k (a k I + b k X H k X k )(I + c k X H k X k ) 1 = b k c k X k + (a k b k c k ) X k (I + c k X H k X k ) 1 }{{} QR based implementation of DHW: [ ] [ ] ck X k Q1 = R, I Q 2 X k+1 = b k ck X k + 1 ck (a k b k ck )Q 1 Q H 2,

45 Example: verification of theoretical iteration count Diagonal matrices A. Iter numbers and X k I F κ(a) Halley Gander DWH Halley 4.7e e e e e e-16 Gander 7.6e e e e e e-16 DWH 4.9e e e e e e-16

46 Example: Q-DWH and scaled Newton s iteration Random matrices with varying condition numbers: kappa min max min max min max iter Q-DWH SN res Q-DWH 5.3e e e e e e-16 SN 7.4e e e e e e-15 Multicore implementation is progress.

47 Further reading - II The inverse-free implementation of the Halley s iteration is described in the following paper Y. Nakatsukasa, Z. Bai and F. Gygi, Optimizing Halley s iteration for computing the matrix polar decomposition, submitted, 2009 It is available on our class website.

48 Concluding remarks 1. Our case studies illustrate the ideas of designing communication-reducing numerical linear algebra algorithms to balance flops, communication, flops, stability and optimize the performance on modern computer architecture. 2. Hardware trend means the time has come to do! 3. Lots of prior work in numerical linear algebra community, but more open problems 4. Why stops at numerical linear algebra?

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March

More information

Multi-Length Scale Matrix Computations and Applications in Quantum Mechanical Simulations

Multi-Length Scale Matrix Computations and Applications in Quantum Mechanical Simulations Multi-Length Scale Matrix Computations and Applications in Quantum Mechanical Simulations Zhaojun Bai http://www.cs.ucdavis.edu/ bai joint work with Wenbin Chen, Roger Lee, Richard Scalettar, Ichitaro

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Algorithms and Perturbation Theory for Matrix Eigenvalue Problems and the SVD

Algorithms and Perturbation Theory for Matrix Eigenvalue Problems and the SVD Algorithms and Perturbation Theory for Matrix Eigenvalue Problems and the SVD Yuji Nakatsukasa PhD dissertation University of California, Davis Supervisor: Roland Freund Householder 2014 2/28 Acknowledgment

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators

Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators Sergiy Gogolenko, Zhaojun Bai 2, and Richard Scalettar 2 Donetsk National Technical University, Donetsk, 8300,

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 37, No. 3, pp. C307 C330 c 2015 Society for Industrial and Applied Mathematics MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUS

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

QUEST: QUantum Electron Simulation Toolbox

QUEST: QUantum Electron Simulation Toolbox San Jose State University From the SelectedWorks of Ehsan Khatami July, 2010 QUEST: QUantum Electron Simulation Toolbox C.-R. Lee, National Tsinghua University S. Chiesa, University of California, Davis

More information

Avoiding Communication in Distributed-Memory Tridiagonalization

Avoiding Communication in Distributed-Memory Tridiagonalization Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou (ctheodos@grid.auth.gr) User and Application Support Scientific Computing Centre @ AUTH Presentation Outline

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common

More information

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006. LAPACK-Style Codes for Pivoted Cholesky and QR Updating Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig 2007 MIMS EPrint: 2006.385 Manchester Institute for Mathematical Sciences School of Mathematics

More information

Parallelization of the Molecular Orbital Program MOS-F

Parallelization of the Molecular Orbital Program MOS-F Parallelization of the Molecular Orbital Program MOS-F Akira Asato, Satoshi Onodera, Yoshie Inada, Elena Akhmatskaya, Ross Nobes, Azuma Matsuura, Atsuya Takahashi November 2003 Fujitsu Laboratories of

More information

Advancing Large Scale Many-Body QMC Simulations on GPU Accelerated Multicore Systems

Advancing Large Scale Many-Body QMC Simulations on GPU Accelerated Multicore Systems Advancing Large Scale Many-Body QMC Simulations on GPU Accelerated Multicore Systems Andres Tomas, Chia-Chen Chang, Richard Scalettar and Zhaojun Bai Department of Computer Science, University of California,

More information

A hybrid Hermitian general eigenvalue solver

A hybrid Hermitian general eigenvalue solver Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,

More information

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems G. Hager HPC Services, Computing Center Erlangen, Germany E. Jeckelmann Theoretical Physics, Univ.

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

Sonderforschungsbereich 393

Sonderforschungsbereich 393 Sonderforschungsbereich 393 Parallele Numerische Simulation für Physik und Kontinuumsmechanik Peter Benner Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Solving Stable Sylvester Equations via Rational

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information

Parallel Eigensolver Performance on High Performance Computers

Parallel Eigensolver Performance on High Performance Computers Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

A communication-avoiding thick-restart Lanczos method on a distributed-memory system A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart

More information

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University

More information

The Future of LAPACK and ScaLAPACK

The Future of LAPACK and ScaLAPACK The Future of LAPACK and ScaLAPACK Jason Riedy, Yozo Hida, James Demmel EECS Department University of California, Berkeley November 18, 2005 Outline Survey responses: What users want Improving LAPACK and

More information

Practicality of Large Scale Fast Matrix Multiplication

Practicality of Large Scale Fast Matrix Multiplication Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by

More information

Performance optimization of WEST and Qbox on Intel Knights Landing

Performance optimization of WEST and Qbox on Intel Knights Landing Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

LAPACK-Style Codes for Pivoted Cholesky and QR Updating

LAPACK-Style Codes for Pivoted Cholesky and QR Updating LAPACK-Style Codes for Pivoted Cholesky and QR Updating Sven Hammarling 1, Nicholas J. Higham 2, and Craig Lucas 3 1 NAG Ltd.,Wilkinson House, Jordan Hill Road, Oxford, OX2 8DR, England, sven@nag.co.uk,

More information

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Théo Mary, Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Jack Dongarra presenter 1 Low-Rank

More information

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Amal Khabou James Demmel Laura Grigori Ming Gu Electrical Engineering and Computer Sciences University of California

More information

Scalable numerical algorithms for electronic structure calculations

Scalable numerical algorithms for electronic structure calculations Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster

More information

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version 1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2

More information

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk

More information

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ . p.1/21 11 Dec. 2014, LJLL, Paris FreeFem++ workshop A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ Atsushi Suzuki Atsushi.Suzuki@ann.jussieu.fr Joint work with François-Xavier

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Making electronic structure methods scale: Large systems and (massively) parallel computing

Making electronic structure methods scale: Large systems and (massively) parallel computing AB Making electronic structure methods scale: Large systems and (massively) parallel computing Ville Havu Department of Applied Physics Helsinki University of Technology - TKK Ville.Havu@tkk.fi 1 Outline

More information

INITIAL INTEGRATION AND EVALUATION

INITIAL INTEGRATION AND EVALUATION INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52

More information

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

On aggressive early deflation in parallel variants of the QR algorithm

On aggressive early deflation in parallel variants of the QR algorithm On aggressive early deflation in parallel variants of the QR algorithm Bo Kågström 1, Daniel Kressner 2, and Meiyue Shao 1 1 Department of Computing Science and HPC2N Umeå University, S-901 87 Umeå, Sweden

More information

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

S N. hochdimensionaler Lyapunov- und Sylvestergleichungen. Peter Benner. Mathematik in Industrie und Technik Fakultät für Mathematik TU Chemnitz

S N. hochdimensionaler Lyapunov- und Sylvestergleichungen. Peter Benner. Mathematik in Industrie und Technik Fakultät für Mathematik TU Chemnitz Ansätze zur numerischen Lösung hochdimensionaler Lyapunov- und Sylvestergleichungen Peter Benner Mathematik in Industrie und Technik Fakultät für Mathematik TU Chemnitz S N SIMULATION www.tu-chemnitz.de/~benner

More information

Roundoff Error. Monday, August 29, 11

Roundoff Error. Monday, August 29, 11 Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized

More information

Outline. Recursive QR factorization. Hybrid Recursive QR factorization. The linear least squares problem. SMP algorithms and implementations

Outline. Recursive QR factorization. Hybrid Recursive QR factorization. The linear least squares problem. SMP algorithms and implementations Outline Recursie QR factorization Hybrid Recursie QR factorization he linear least squares problem SMP algorithms and implementations Conclusions Recursion Automatic ariable blocking (e.g., QR factorization)

More information

Spectrum-Revealing Matrix Factorizations Theory and Algorithms

Spectrum-Revealing Matrix Factorizations Theory and Algorithms Spectrum-Revealing Matrix Factorizations Theory and Algorithms Ming Gu Department of Mathematics University of California, Berkeley April 5, 2016 Joint work with D. Anderson, J. Deursch, C. Melgaard, J.

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

CALU: A Communication Optimal LU Factorization Algorithm

CALU: A Communication Optimal LU Factorization Algorithm CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

Direct methods for symmetric eigenvalue problems

Direct methods for symmetric eigenvalue problems Direct methods for symmetric eigenvalue problems, PhD McMaster University School of Computational Engineering and Science February 4, 2008 1 Theoretical background Posing the question Perturbation theory

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

Lecture 8: Fast Linear Solvers (Part 7)

Lecture 8: Fast Linear Solvers (Part 7) Lecture 8: Fast Linear Solvers (Part 7) 1 Modified Gram-Schmidt Process with Reorthogonalization Test Reorthogonalization If Av k 2 + δ v k+1 2 = Av k 2 to working precision. δ = 10 3 2 Householder Arnoldi

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors Aachen Institute for Advanced Study in Computational Engineering Science Preprint: AICES-2010/09-4 23/September/2010 The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and

More information

On GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations

On GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations Max Planck Institute Magdeburg Preprints Martin Köhler Jens Saak On GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations MAX PLANCK INSTITUT FÜR DYNAMIK KOMPLEXER TECHNISCHER

More information

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of

More information

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,

More information

The Ongoing Development of CSDP

The Ongoing Development of CSDP The Ongoing Development of CSDP Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM 87801 borchers@nmt.edu Joseph Young Department of Mathematics New Mexico Tech (Now at Rice University)

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

Susumu YAMADA 1,3 Toshiyuki IMAMURA 2,3, Masahiko MACHIDA 1,3

Susumu YAMADA 1,3 Toshiyuki IMAMURA 2,3, Masahiko MACHIDA 1,3 Dynamical Variation of Eigenvalue Problems in Density-Matrix Renormalization-Group Code PP12, Feb. 15, 2012 1 Center for Computational Science and e-systems, Japan Atomic Energy Agency 2 The University

More information

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems LESLIE FOSTER and RAJESH KOMMU San Jose State University Existing routines, such as xgelsy or xgelsd in LAPACK, for

More information

Notes on Householder QR Factorization

Notes on Householder QR Factorization Notes on Householder QR Factorization Robert A van de Geijn Department of Computer Science he University of exas at Austin Austin, X 7872 rvdg@csutexasedu September 2, 24 Motivation A fundamental problem

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

HPMPC - A new software package with efficient solvers for Model Predictive Control

HPMPC - A new software package with efficient solvers for Model Predictive Control - A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model

More information

Finite-choice algorithm optimization in Conjugate Gradients

Finite-choice algorithm optimization in Conjugate Gradients Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the

More information

Lecture 13 Stability of LU Factorization; Cholesky Factorization. Songting Luo. Department of Mathematics Iowa State University

Lecture 13 Stability of LU Factorization; Cholesky Factorization. Songting Luo. Department of Mathematics Iowa State University Lecture 13 Stability of LU Factorization; Cholesky Factorization Songting Luo Department of Mathematics Iowa State University MATH 562 Numerical Analysis II ongting Luo ( Department of Mathematics Iowa

More information

Porting a Sphere Optimization Program from lapack to scalapack

Porting a Sphere Optimization Program from lapack to scalapack Porting a Sphere Optimization Program from lapack to scalapack Paul C. Leopardi Robert S. Womersley 12 October 2008 Abstract The sphere optimization program sphopt was originally written as a sequential

More information

Porting a sphere optimization program from LAPACK to ScaLAPACK

Porting a sphere optimization program from LAPACK to ScaLAPACK Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference

More information

Large Scale Sparse Linear Algebra

Large Scale Sparse Linear Algebra Large Scale Sparse Linear Algebra P. Amestoy (INP-N7, IRIT) A. Buttari (CNRS, IRIT) T. Mary (University of Toulouse, IRIT) A. Guermouche (Univ. Bordeaux, LaBRI), J.-Y. L Excellent (INRIA, LIP, ENS-Lyon)

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Parallel Singular Value Decomposition. Jiaxing Tan

Parallel Singular Value Decomposition. Jiaxing Tan Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate SVD? How to parallelize SVD? Future Work What is SVD? Matrix Decomposition Eigen Decomposition A (non-zero) vector

More information

Minimizing Communication in Linear Algebra. James Demmel 15 June

Minimizing Communication in Linear Algebra. James Demmel 15 June Minimizing Communication in Linear Algebra James Demmel 15 June 2010 www.cs.berkeley.edu/~demmel 1 Outline What is communication and why is it important to avoid? Direct Linear Algebra Lower bounds on

More information

A New Block Algorithm for Full-Rank Solution of the Sylvester-observer Equation.

A New Block Algorithm for Full-Rank Solution of the Sylvester-observer Equation. 1 A New Block Algorithm for Full-Rank Solution of the Sylvester-observer Equation João Carvalho, DMPA, Universidade Federal do RS, Brasil Karabi Datta, Dep MSc, Northern Illinois University, DeKalb, IL

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

A DIVIDE-AND-CONQUER METHOD FOR THE TAKAGI FACTORIZATION

A DIVIDE-AND-CONQUER METHOD FOR THE TAKAGI FACTORIZATION SIAM J MATRIX ANAL APPL Vol 0, No 0, pp 000 000 c XXXX Society for Industrial and Applied Mathematics A DIVIDE-AND-CONQUER METHOD FOR THE TAKAGI FACTORIZATION WEI XU AND SANZHENG QIAO Abstract This paper

More information

A distributed packed storage for large dense parallel in-core calculations

A distributed packed storage for large dense parallel in-core calculations CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A

More information

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Peter Benner, Andreas Marek, Carolin Penke August 16, 2018 ELSI Workshop 2018 Partners: The Problem The Bethe-Salpeter

More information

Algorithms and Methods for Fast Model Predictive Control

Algorithms and Methods for Fast Model Predictive Control Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model

More information

Introduction to communication avoiding linear algebra algorithms in high performance computing

Introduction to communication avoiding linear algebra algorithms in high performance computing Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for

More information