Mapping Dense LU Factorization on Multicore Supercomputer Nodes

Size: px
Start display at page:

Download "Mapping Dense LU Factorization on Multicore Supercomputer Nodes"

Transcription

1 Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander, Phil Miller, Ramprasad Venkataraman, nshu rya, erry Jones, Laxmikant Kale niversity Oak of Illinois rbana-champaign Ridge National Laboratory May 22, 2012

2 N N L Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 2/42

3 Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 3/42

4 Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 4/42

5 ctive panel block block railing submatrix block Previously factored block Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 5/42

6 ctive panel block block railing submatrix block Previously factored block Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 6/42

7 ctive panel block block railing submatrix block Previously factored block Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 7/42

8 ctive panel block block railing submatrix block Previously factored block Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 8/42

9 Process grid with 16 cores Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 9/42

10 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 10/42

11 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 11/42

12 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 12/42

13 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 13/42

14 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 14/42

15 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 15/42

16 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 16/42

17 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 17/42

18 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 18/42

19 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 19/42

20 Overview Striding: vary the amount of locality in the active panel Rotation: vary the degree of parallelism in the row/column estbed implemented in Charm++ Cray X5: 67% of peak for 8k cores utilizing 75% of memory for the matrix Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 20/42

21 estbed llows easy experimentation with various mappings int map(const int coor[2]) { int gridx = coor[0] % P; int gridy = coor[1] % Q; int proc = gridy P + gridx; return proc; } // N/b x N/b array of blocks, b is block size CkrrayOptions opts(n/b, N/b); // set block to processor mapping opts.setmap(map); // create parallel array luproxy = CProxy_LBlock::ckNew(opts); Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 21/42

22 Striding Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 22/42

23 node node node node 3 Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 23/42

24 Rank-1 pdate imes (ms) Microarchitecture Cores/socket performing updates Efficiency Intel Nehalem-EP % MD Istanbul % IBM Blue Gene/P % Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 24/42

25 Weak-Scaled Single-Node Microbenchmark Number of L3 Cache Misses (x1000) Expected Compulsory Misses Measured Misses Number of Cores Performing Rank-1 pdates Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 25/42

26 Striding: a range of values for u Block-cylic: Block-cylic with stride (s = 2): Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 26/42

27 Formulæ: augmenting block-cyclic with a stride s Block-cylic: m 1 = x mod P (x in grid) n 1 = y mod Q (y in grid) f r(x, y, P, Q) = m 1 Q + n 1 (1) f c(x, y, P, Q) = n 1 P + m 1 (2) Block-cylic with striding: y p = Q m 3 =x mod P (grid y index) (x in grid) n 3 =y mod s (y in subgrid) y mod Q q = (subgrid y) s f s(x, y, P, Q, s) =m 3 s + n 3 + P sq (3) where 1 s Q and s is a factor of Q. Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 27/42

28 cores 528 cores 2112 cores 6 GFlop/s/core row major ctive panel cores/node column major Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 28/42

29 Rotation Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 29/42

30 Process grid with 24 cores Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 30/42

31 Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 31/42

32 7 6 Wide (8 x 32) Process Grid Square (16 x 16) Process Grid all (32 x 8) Process Grid 5 ime (seconds) ctive Panel Step Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 32/42

33 cores 528 cores 2112 cores 6 GFlop/s/core tall process grid spect ratio wide process grid Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 33/42

34 Rotation: increasing row parallelism Block-cylic: Block-cylic with rotation (r = 2): Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 34/42

35 Formulæ: augmenting block-cyclic with a rotation r Block-cylic: m 1 = x mod P (x in grid) n 1 = y mod Q (y in grid) f r(x, y, P, Q) = m 1 Q + n 1 (4) f c(x, y, P, Q) = n 1 P + m 1 (5) Block-cylic with rotation: y p = Q m 2 = (x + pr) mod P n 2 = y mod Q (grid y index) (x in grid) (y in grid) f rotrow (x, y, P, Q, r) = m 2 Q + n 2 (6) f rotcol (x, y, P, Q, r) = n 2 P + m 2 (7) Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 35/42

36 cores 528 cores 2112 cores 6 GFlop/s/core less rotation more rotation Cores performing triangular solves Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 36/42

37 cores 528 cores 2112 cores 6 GFlop/s/core Cores performing triangular solves / sqrt(#cores) Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 37/42

38 Combining striding and rotation Block-cylic: Block-cylic with striding and rotation: Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 38/42

39 Formulæ: augmenting parameters r and s Block-cylic: m 1 = x mod P (x in grid) n 1 = y mod Q (y in grid) f r(x, y, P, Q) = m 1 Q + n 1 (8) f c(x, y, P, Q) = n 1 P + m 1 (9) Block-cylic with striding and rotation: y p = Q m 4 =(x + pr) mod P n 4 =y mod s y mod Q q = s (grid y index) (x in grid) (y in grid) (subgrid y) f sr(x, y, P, Q, r, s) =m 4 s + n 4 + P sq (10) Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 39/42

40 Data Movement Overhead Number of Cores Factorization ime (seconds) Data Movement ime (seconds) otal ime (seconds) Data Movement Percentage of otal 1.9% 1.0% Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 40/42

41 Conclusion Performance compared to best square grid Striding: 8% performance increase Rotation: 3% performance increase Combined: 11% performance increase, corresponding to a 7% increase in peak performance With memory usage constant at about 75% on Cray X5: Future work bout 67% of peak from 120 cores to 8064 cores Recursive panel factorization Communication-avoiding Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 41/42

42 100 heoretical peak on X5 Weak scaling on X5 heoretical peak on BG/P Strong scaling on BG/P 10 otal Flop/s Number of Cores Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 42/42

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Introduction to communication avoiding linear algebra algorithms in high performance computing

Introduction to communication avoiding linear algebra algorithms in high performance computing Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for

More information

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois

More information

Cache Oblivious Stencil Computations

Cache Oblivious Stencil Computations Cache Oblivious Stencil Computations S. HUNOLD J. L. TRÄFF F. VERSACI Lectures on High Performance Computing 13 April 2015 F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 1 / 19

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

c 2016 Society for Industrial and Applied Mathematics

c 2016 Society for Industrial and Applied Mathematics SIAM J SCI COMPUT Vol 38, No 6, pp C748 C781 c 2016 Society for Industrial and Applied Mathematics PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK

More information

Scalable numerical algorithms for electronic structure calculations

Scalable numerical algorithms for electronic structure calculations Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster

More information

Outline. Recursive QR factorization. Hybrid Recursive QR factorization. The linear least squares problem. SMP algorithms and implementations

Outline. Recursive QR factorization. Hybrid Recursive QR factorization. The linear least squares problem. SMP algorithms and implementations Outline Recursie QR factorization Hybrid Recursie QR factorization he linear least squares problem SMP algorithms and implementations Conclusions Recursion Automatic ariable blocking (e.g., QR factorization)

More information

Cyclops Tensor Framework

Cyclops Tensor Framework Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r

More information

The Design and Implementation of the Parallel. Out-of-core ScaLAPACK LU, QR and Cholesky. Factorization Routines

The Design and Implementation of the Parallel. Out-of-core ScaLAPACK LU, QR and Cholesky. Factorization Routines The Design and Implementation of the Parallel Out-of-core ScaLAPACK LU, QR and Cholesky Factorization Routines Eduardo D Azevedo Jack Dongarra Abstract This paper describes the design and implementation

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Modeling and Tuning Parallel Performance in Dense Linear Algebra Modeling and Tuning Parallel Performance in Dense Linear Algebra Initial Experiences with the Tile QR Factorization on a Multi Core System CScADS Workshop on Automatic Tuning for Petascale Systems Snowbird,

More information

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Direct and Incomplete Cholesky Factorizations with Static Supernodes Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)

More information

Dense LU factorization and its error analysis

Dense LU factorization and its error analysis Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University

More information

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers Parallel Multivariate SpatioTemporal Clustering of Large Ecological Datasets on Hybrid Supercomputers Sarat Sreepathi1, Jitendra Kumar1, Richard T. Mills2, Forrest M. Hoffman1, Vamsi Sripathi3, William

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

Scalable, Robust, Fault-Tolerant Parallel QR Factorization. Camille Coti

Scalable, Robust, Fault-Tolerant Parallel QR Factorization. Camille Coti Communication-avoiding Fault-tolerant Scalable, Robust, Fault-Tolerant Parallel Factorization Camille Coti LIPN, CNRS UMR 7030, SPC, Université Paris 13, France DCABES August 24th, 2016 1 / 21 C. Coti

More information

Parallel Eigensolver Performance on High Performance Computers 1

Parallel Eigensolver Performance on High Performance Computers 1 Parallel Eigensolver Performance on High Performance Computers 1 Andrew Sunderland STFC Daresbury Laboratory, Warrington, UK Abstract Eigenvalue and eigenvector computations arise in a wide range of scientific

More information

QR Decomposition in a Multicore Environment

QR Decomposition in a Multicore Environment QR Decomposition in a Multicore Environment Omar Ahsan University of Maryland-College Park Advised by Professor Howard Elman College Park, MD oha@cs.umd.edu ABSTRACT In this study we examine performance

More information

Parallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication

Parallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication Parallel Integer Polynomial Multiplication Parallel Integer Polynomial Multiplication Changbo Chen 1 Svyatoslav Covanov 2,3 Farnam Mansouri 2 Marc Moreno Maza 2 Ning Xie 2 Yuzhen Xie 2 1 Chinese Academy

More information

ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3

ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3 ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3 ISSUED 24 FEBRUARY 2018 1 Gaussian elimination Let A be an (m n)-matrix Consider the following row operations on A (1) Swap the positions any

More information

The Performance Evolution of the Parallel Ocean Program on the Cray X1

The Performance Evolution of the Parallel Ocean Program on the Cray X1 The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott

More information

High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors

High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors Procedia Computer Science Procedia Computer Science 00 (2012) 1 10 International Conference on Computational Science, ICCS 2012 High Performance Dense Linear System Solver with Resilience to Multiple Soft

More information

Towards Highly Parallel and Compute-Bound Computation of Eigenvectors of Matrices in Schur Form

Towards Highly Parallel and Compute-Bound Computation of Eigenvectors of Matrices in Schur Form Towards Highly Parallel and Compute-Bound Computation of Eigenvectors of Matrices in Schur Form Björn Adlerborn, Carl Christian Kjelgaard Mikkelsen, Lars Karlsson, Bo Kågström May 30, 2017 Abstract In

More information

Performance of WRF using UPC

Performance of WRF using UPC Performance of WRF using UPC Hee-Sik Kim and Jong-Gwan Do * Cray Korea ABSTRACT: The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. We

More information

Downloaded 08/28/12 to Redistribution subject to SIAM license or copyright; see

Downloaded 08/28/12 to Redistribution subject to SIAM license or copyright; see SIAM J. MATRIX ANAL. APPL. Vol. 26, No. 3, pp. 621 639 c 2005 Society for Industrial and Applied Mathematics ROW MODIFICATIONS OF A SPARSE CHOLESKY FACTORIZATION TIMOTHY A. DAVIS AND WILLIAM W. HAGER Abstract.

More information

Parallel Eigensolver Performance on High Performance Computers

Parallel Eigensolver Performance on High Performance Computers Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization

More information

NOAA Research and Development High Performance Compu3ng Office Craig Tierney, U. of Colorado at Boulder Leslie Hart, NOAA CIO Office

NOAA Research and Development High Performance Compu3ng Office Craig Tierney, U. of Colorado at Boulder Leslie Hart, NOAA CIO Office A survey of performance characteris3cs of NOAA s weather and climate codes across our HPC systems NOAA Research and Development High Performance Compu3ng Office Craig Tierney, U. of Colorado at Boulder

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

SCALABLE HYBRID SPARSE LINEAR SOLVERS

SCALABLE HYBRID SPARSE LINEAR SOLVERS The Pennsylvania State University The Graduate School Department of Computer Science and Engineering SCALABLE HYBRID SPARSE LINEAR SOLVERS A Thesis in Computer Science and Engineering by Keita Teranishi

More information

Fault Tolerance & Reliability CDA Chapter 2 Cyclic Polynomial Codes

Fault Tolerance & Reliability CDA Chapter 2 Cyclic Polynomial Codes Fault Tolerance & Reliability CDA 5140 Chapter 2 Cyclic Polynomial Codes - cylic code: special type of parity check code such that every cyclic shift of codeword is a codeword - for example, if (c n-1,

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville Performance of the fusion code GYRO on three four generations of Crays Mark Fahey mfahey@utk.edu University of Tennessee, Knoxville Contents Introduction GYRO Overview Benchmark Problem Test Platforms

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Practical performance portability in the Parallel Ocean Program (POP)

Practical performance portability in the Parallel Ocean Program (POP) See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/227606785 Practical performance portability in the Parallel Ocean Program (POP) Article in

More information

HPC and High-end Data Science

HPC and High-end Data Science HPC and High-end Data Science for the Power Grid Alex Pothen August 3, 2018 Outline High-end Data Science 1 3 Data Anonymization Contingency Analysis 2 4 Parallel Oscillation Monitoring 2 / 22 PMUs in

More information

A distributed packed storage for large dense parallel in-core calculations

A distributed packed storage for large dense parallel in-core calculations CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A

More information

Vector Lane Threading

Vector Lane Threading Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program

More information

Strassen s Algorithm for Tensor Contraction

Strassen s Algorithm for Tensor Contraction Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,

More information

Lecture 19. Architectural Directions

Lecture 19. Architectural Directions Lecture 19 Architectural Directions Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2 Final examination Announcements Thursday, March 17, in this room:

More information

GPU Computing Activities in KISTI

GPU Computing Activities in KISTI International Advanced Research Workshop on High Performance Computing, Grids and Clouds 2010 June 21~June 25 2010, Cetraro, Italy HPC Infrastructure and GPU Computing Activities in KISTI Hongsuk Yi hsyi@kisti.re.kr

More information

Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing

Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing Prasanna Balaprakash, Leonardo A. Bautista Gomez, Slim Bouguerra, Stefan M. Wild, Franck Cappello, and Paul D. Hovland

More information

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March

More information

Environment (Parallelizing Query Optimization)

Environment (Parallelizing Query Optimization) Advanced d Query Optimization i i Techniques in a Parallel Computing Environment (Parallelizing Query Optimization) Wook-Shin Han*, Wooseong Kwak, Jinsoo Lee Guy M. Lohman, Volker Markl Kyungpook National

More information

MSC HPC Infrastructure Update. Alain St-Denis Canadian Meteorological Centre Meteorological Service of Canada

MSC HPC Infrastructure Update. Alain St-Denis Canadian Meteorological Centre Meteorological Service of Canada MSC HPC Infrastructure Update Alain St-Denis Canadian Meteorological Centre Meteorological Service of Canada Outline HPC Infrastructure Overview Supercomputer Configuration Scientific Direction 2 IT Infrastructure

More information

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB What Is LINPACK? LINPACK is a package of mathematical

More information

Shortest paths with negative lengths

Shortest paths with negative lengths Chapter 8 Shortest paths with negative lengths In this chapter we give a linear-space, nearly linear-time algorithm that, given a directed planar graph G with real positive and negative lengths, but no

More information

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory Nuclear Physics and Computing: Exascale Partnerships Juan Meza Senior Scientist Lawrence Berkeley National Laboratory Nuclear Science and Exascale i Workshop held in DC to identify scientific challenges

More information

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y) 5.1 Banded Storage u = temperature u= u h temperature at gridpoints u h = 1 u= Laplace s equation u= h u = u h = grid size u=1 The five-point difference operator 1 u h =1 uh (x + h, y) 2u h (x, y)+u h

More information

Communication Avoiding LU Factorization using Complete Pivoting

Communication Avoiding LU Factorization using Complete Pivoting Communication Avoiding LU Factorization using Complete Pivoting Implementation and Analysis Avinash Bhardwaj Department of Industrial Engineering and Operations Research University of California, Berkeley

More information

Performance optimization of WEST and Qbox on Intel Knights Landing

Performance optimization of WEST and Qbox on Intel Knights Landing Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University

More information

Parallel Sparse Tensor Decompositions using HiCOO Format

Parallel Sparse Tensor Decompositions using HiCOO Format Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, 8 @ SIAM ALA 8 Outline

More information

Direct solution methods for sparse matrices. p. 1/49

Direct solution methods for sparse matrices. p. 1/49 Direct solution methods for sparse matrices p. 1/49 p. 2/49 Direct solution methods for sparse matrices Solve Ax = b, where A(n n). (1) Factorize A = LU, L lower-triangular, U upper-triangular. (2) Solve

More information

CALU: A Communication Optimal LU Factorization Algorithm

CALU: A Communication Optimal LU Factorization Algorithm CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9

More information

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

Population Estimation: Using High-Performance Computing in Statistical Research. Craig Finch Zia Rehman

Population Estimation: Using High-Performance Computing in Statistical Research. Craig Finch Zia Rehman Population Estimation: Using High-Performance Computing in Statistical Research Craig Finch Zia Rehman Statistical estimation Estimated Value Confidence Interval Actual Value Estimator: a rule for finding

More information

Review for the Midterm Exam

Review for the Midterm Exam Review for the Midterm Exam 1 Three Questions of the Computational Science Prelim scaled speedup network topologies work stealing 2 The in-class Spring 2012 Midterm Exam pleasingly parallel computations

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training

More information

Balanced Dense Polynomial Multiplication on Multicores

Balanced Dense Polynomial Multiplication on Multicores Balanced Dense Polynomial Multiplication on Multicores Yuzhen Xie SuperTech Group, CSAIL MIT joint work with Marc Moreno Maza ORCCA, UWO ACA09, Montreal, June 26, 2009 Introduction Motivation: Multicore-enabling

More information

Parallel Sparse LU Factorization on Second-class Message Passing Platforms

Parallel Sparse LU Factorization on Second-class Message Passing Platforms Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen kshen@cs.rochester.edu Department of Computer Science, University of Rochester ABSTRACT Several message passing-based

More information

Construction of low complexity Array based Quasi Cyclic Low density parity check (QC-LDPC) codes with low error floor

Construction of low complexity Array based Quasi Cyclic Low density parity check (QC-LDPC) codes with low error floor Construction of low complexity Array based Quasi Cyclic Low density parity check (QC-LDPC) codes with low error floor Pravin Salunkhe, Prof D.P Rathod Department of Electrical Engineering, Veermata Jijabai

More information

Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures

Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 2, FEBRUARY 1998 109 Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures Cong Fu, Xiangmin Jiao,

More information

Sparse linear solvers

Sparse linear solvers Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky

More information

FAS and Solver Performance

FAS and Solver Performance FAS and Solver Performance Matthew Knepley Mathematics and Computer Science Division Argonne National Laboratory Fall AMS Central Section Meeting Chicago, IL Oct 05 06, 2007 M. Knepley (ANL) FAS AMS 07

More information

Memory-Access Optimization of Parallel Molecular Dynamics Simulation via Dynamic Data Reordering

Memory-Access Optimization of Parallel Molecular Dynamics Simulation via Dynamic Data Reordering Memory-Access Optimization of Parallel Molecular Dynamics Simulation via Dynamic Data Reordering Manaschai Kunaseth, Ken-ichi Nomura, Hikmet Dursun, Rajiv K. Kalia, Aiichiro Nakano, and Priya Vashishta

More information

Efficiency of Dynamic Load Balancing Based on Permanent Cells for Parallel Molecular Dynamics Simulation

Efficiency of Dynamic Load Balancing Based on Permanent Cells for Parallel Molecular Dynamics Simulation Efficiency of Dynamic Load Balancing Based on Permanent Cells for Parallel Molecular Dynamics Simulation Ryoko Hayashi and Susumu Horiguchi School of Information Science, Japan Advanced Institute of Science

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

EEC 281 VLSI Digital Signal Processing Notes on the RRI-FFT

EEC 281 VLSI Digital Signal Processing Notes on the RRI-FFT EEC 281 VLSI Digital Signal Processing Notes on the RRI-FFT Bevan Baas The attached description of the RRI-FFT is taken from a chapter describing the cached-fft algorithm and there are therefore occasional

More information

On The Energy Complexity of Parallel Algorithms

On The Energy Complexity of Parallel Algorithms On The Energy Complexity of Parallel Algorithms Vijay Anand Korthikanti Department of Computer Science University of Illinois at Urbana Champaign vkortho2@illinois.edu Gul Agha Department of Computer Science

More information

Key words. numerical linear algebra, direct methods, Cholesky factorization, sparse matrices, mathematical software, matrix updates.

Key words. numerical linear algebra, direct methods, Cholesky factorization, sparse matrices, mathematical software, matrix updates. ROW MODIFICATIONS OF A SPARSE CHOLESKY FACTORIZATION TIMOTHY A. DAVIS AND WILLIAM W. HAGER Abstract. Given a sparse, symmetric positive definite matrix C and an associated sparse Cholesky factorization

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Making electronic structure methods scale: Large systems and (massively) parallel computing

Making electronic structure methods scale: Large systems and (massively) parallel computing AB Making electronic structure methods scale: Large systems and (massively) parallel computing Ville Havu Department of Applied Physics Helsinki University of Technology - TKK Ville.Havu@tkk.fi 1 Outline

More information

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.

More information

Performance and Application of Observation Sensitivity to Global Forecasts on the KMA Cray XE6

Performance and Application of Observation Sensitivity to Global Forecasts on the KMA Cray XE6 Performance and Application of Observation Sensitivity to Global Forecasts on the KMA Cray XE6 Sangwon Joo, Yoonjae Kim, Hyuncheol Shin, Eunhee Lee, Eunjung Kim (Korea Meteorological Administration) Tae-Hun

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers Victor Yu and the ELSI team Department of Mechanical Engineering & Materials Science Duke University Kohn-Sham Density-Functional

More information

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version 1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2

More information