Mapping Dense LU Factorization on Multicore Supercomputer Nodes
|
|
- Bathsheba Atkinson
- 5 years ago
- Views:
Transcription
1 Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander, Phil Miller, Ramprasad Venkataraman, nshu rya, erry Jones, Laxmikant Kale niversity Oak of Illinois rbana-champaign Ridge National Laboratory May 22, 2012
2 N N L Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 2/42
3 Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 3/42
4 Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 4/42
5 ctive panel block block railing submatrix block Previously factored block Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 5/42
6 ctive panel block block railing submatrix block Previously factored block Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 6/42
7 ctive panel block block railing submatrix block Previously factored block Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 7/42
8 ctive panel block block railing submatrix block Previously factored block Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 8/42
9 Process grid with 16 cores Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 9/42
10 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 10/42
11 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 11/42
12 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 12/42
13 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 13/42
14 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 14/42
15 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 15/42
16 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 16/42
17 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 17/42
18 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 18/42
19 N N Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 19/42
20 Overview Striding: vary the amount of locality in the active panel Rotation: vary the degree of parallelism in the row/column estbed implemented in Charm++ Cray X5: 67% of peak for 8k cores utilizing 75% of memory for the matrix Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 20/42
21 estbed llows easy experimentation with various mappings int map(const int coor[2]) { int gridx = coor[0] % P; int gridy = coor[1] % Q; int proc = gridy P + gridx; return proc; } // N/b x N/b array of blocks, b is block size CkrrayOptions opts(n/b, N/b); // set block to processor mapping opts.setmap(map); // create parallel array luproxy = CProxy_LBlock::ckNew(opts); Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 21/42
22 Striding Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 22/42
23 node node node node 3 Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 23/42
24 Rank-1 pdate imes (ms) Microarchitecture Cores/socket performing updates Efficiency Intel Nehalem-EP % MD Istanbul % IBM Blue Gene/P % Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 24/42
25 Weak-Scaled Single-Node Microbenchmark Number of L3 Cache Misses (x1000) Expected Compulsory Misses Measured Misses Number of Cores Performing Rank-1 pdates Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 25/42
26 Striding: a range of values for u Block-cylic: Block-cylic with stride (s = 2): Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 26/42
27 Formulæ: augmenting block-cyclic with a stride s Block-cylic: m 1 = x mod P (x in grid) n 1 = y mod Q (y in grid) f r(x, y, P, Q) = m 1 Q + n 1 (1) f c(x, y, P, Q) = n 1 P + m 1 (2) Block-cylic with striding: y p = Q m 3 =x mod P (grid y index) (x in grid) n 3 =y mod s (y in subgrid) y mod Q q = (subgrid y) s f s(x, y, P, Q, s) =m 3 s + n 3 + P sq (3) where 1 s Q and s is a factor of Q. Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 27/42
28 cores 528 cores 2112 cores 6 GFlop/s/core row major ctive panel cores/node column major Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 28/42
29 Rotation Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 29/42
30 Process grid with 24 cores Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 30/42
31 Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 31/42
32 7 6 Wide (8 x 32) Process Grid Square (16 x 16) Process Grid all (32 x 8) Process Grid 5 ime (seconds) ctive Panel Step Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 32/42
33 cores 528 cores 2112 cores 6 GFlop/s/core tall process grid spect ratio wide process grid Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 33/42
34 Rotation: increasing row parallelism Block-cylic: Block-cylic with rotation (r = 2): Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 34/42
35 Formulæ: augmenting block-cyclic with a rotation r Block-cylic: m 1 = x mod P (x in grid) n 1 = y mod Q (y in grid) f r(x, y, P, Q) = m 1 Q + n 1 (4) f c(x, y, P, Q) = n 1 P + m 1 (5) Block-cylic with rotation: y p = Q m 2 = (x + pr) mod P n 2 = y mod Q (grid y index) (x in grid) (y in grid) f rotrow (x, y, P, Q, r) = m 2 Q + n 2 (6) f rotcol (x, y, P, Q, r) = n 2 P + m 2 (7) Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 35/42
36 cores 528 cores 2112 cores 6 GFlop/s/core less rotation more rotation Cores performing triangular solves Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 36/42
37 cores 528 cores 2112 cores 6 GFlop/s/core Cores performing triangular solves / sqrt(#cores) Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 37/42
38 Combining striding and rotation Block-cylic: Block-cylic with striding and rotation: Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 38/42
39 Formulæ: augmenting parameters r and s Block-cylic: m 1 = x mod P (x in grid) n 1 = y mod Q (y in grid) f r(x, y, P, Q) = m 1 Q + n 1 (8) f c(x, y, P, Q) = n 1 P + m 1 (9) Block-cylic with striding and rotation: y p = Q m 4 =(x + pr) mod P n 4 =y mod s y mod Q q = s (grid y index) (x in grid) (y in grid) (subgrid y) f sr(x, y, P, Q, r, s) =m 4 s + n 4 + P sq (10) Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 39/42
40 Data Movement Overhead Number of Cores Factorization ime (seconds) Data Movement ime (seconds) otal ime (seconds) Data Movement Percentage of otal 1.9% 1.0% Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 40/42
41 Conclusion Performance compared to best square grid Striding: 8% performance increase Rotation: 3% performance increase Combined: 11% performance increase, corresponding to a 7% increase in peak performance With memory usage constant at about 75% on Cray X5: Future work bout 67% of peak from 120 cores to 8064 cores Recursive panel factorization Communication-avoiding Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 41/42
42 100 heoretical peak on X5 Weak scaling on X5 heoretical peak on BG/P Strong scaling on BG/P 10 otal Flop/s Number of Cores Mapping Dense L Factorization on Multicore Supercomputer Nodes Jonathan Lifflander (IC) 42/42
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationIntroduction to communication avoiding linear algebra algorithms in high performance computing
Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for
More informationSymmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano
Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June
More informationLightweight Superscalar Task Execution in Distributed Memory
Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,
More information2.5D algorithms for distributed-memory computing
ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois
More informationCache Oblivious Stencil Computations
Cache Oblivious Stencil Computations S. HUNOLD J. L. TRÄFF F. VERSACI Lectures on High Performance Computing 13 April 2015 F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 1 / 19
More informationCommunication-avoiding LU and QR factorizations for multicore architectures
Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationTall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222
Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationHYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017
HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher
More informationc 2016 Society for Industrial and Applied Mathematics
SIAM J SCI COMPUT Vol 38, No 6, pp C748 C781 c 2016 Society for Industrial and Applied Mathematics PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK
More informationScalable numerical algorithms for electronic structure calculations
Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster
More informationOutline. Recursive QR factorization. Hybrid Recursive QR factorization. The linear least squares problem. SMP algorithms and implementations
Outline Recursie QR factorization Hybrid Recursie QR factorization he linear least squares problem SMP algorithms and implementations Conclusions Recursion Automatic ariable blocking (e.g., QR factorization)
More informationCyclops Tensor Framework
Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r
More informationThe Design and Implementation of the Parallel. Out-of-core ScaLAPACK LU, QR and Cholesky. Factorization Routines
The Design and Implementation of the Parallel Out-of-core ScaLAPACK LU, QR and Cholesky Factorization Routines Eduardo D Azevedo Jack Dongarra Abstract This paper describes the design and implementation
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationModeling and Tuning Parallel Performance in Dense Linear Algebra
Modeling and Tuning Parallel Performance in Dense Linear Algebra Initial Experiences with the Tile QR Factorization on a Multi Core System CScADS Workshop on Automatic Tuning for Petascale Systems Snowbird,
More informationDirect and Incomplete Cholesky Factorizations with Static Supernodes
Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)
More informationDense LU factorization and its error analysis
Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More informationTile QR Factorization with Parallel Panel Processing for Multicore Architectures
Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University
More informationIntroduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra
Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and
More informationINF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)
INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder
More informationParallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers
Parallel Multivariate SpatioTemporal Clustering of Large Ecological Datasets on Hybrid Supercomputers Sarat Sreepathi1, Jitendra Kumar1, Richard T. Mills2, Forrest M. Hoffman1, Vamsi Sripathi3, William
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform
CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast
More informationERLANGEN REGIONAL COMPUTING CENTER
ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,
More information1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria
1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation
More informationScalable, Robust, Fault-Tolerant Parallel QR Factorization. Camille Coti
Communication-avoiding Fault-tolerant Scalable, Robust, Fault-Tolerant Parallel Factorization Camille Coti LIPN, CNRS UMR 7030, SPC, Université Paris 13, France DCABES August 24th, 2016 1 / 21 C. Coti
More informationParallel Eigensolver Performance on High Performance Computers 1
Parallel Eigensolver Performance on High Performance Computers 1 Andrew Sunderland STFC Daresbury Laboratory, Warrington, UK Abstract Eigenvalue and eigenvector computations arise in a wide range of scientific
More informationQR Decomposition in a Multicore Environment
QR Decomposition in a Multicore Environment Omar Ahsan University of Maryland-College Park Advised by Professor Howard Elman College Park, MD oha@cs.umd.edu ABSTRACT In this study we examine performance
More informationParallel Integer Polynomial Multiplication Changbo Chen, Svyatoslav Parallel Integer Covanov, Polynomial FarnamMultiplication
Parallel Integer Polynomial Multiplication Parallel Integer Polynomial Multiplication Changbo Chen 1 Svyatoslav Covanov 2,3 Farnam Mansouri 2 Marc Moreno Maza 2 Ning Xie 2 Yuzhen Xie 2 1 Chinese Academy
More informationANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3
ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3 ISSUED 24 FEBRUARY 2018 1 Gaussian elimination Let A be an (m n)-matrix Consider the following row operations on A (1) Swap the positions any
More informationThe Performance Evolution of the Parallel Ocean Program on the Cray X1
The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott
More informationHigh Performance Dense Linear System Solver with Resilience to Multiple Soft Errors
Procedia Computer Science Procedia Computer Science 00 (2012) 1 10 International Conference on Computational Science, ICCS 2012 High Performance Dense Linear System Solver with Resilience to Multiple Soft
More informationTowards Highly Parallel and Compute-Bound Computation of Eigenvectors of Matrices in Schur Form
Towards Highly Parallel and Compute-Bound Computation of Eigenvectors of Matrices in Schur Form Björn Adlerborn, Carl Christian Kjelgaard Mikkelsen, Lars Karlsson, Bo Kågström May 30, 2017 Abstract In
More informationPerformance of WRF using UPC
Performance of WRF using UPC Hee-Sik Kim and Jong-Gwan Do * Cray Korea ABSTRACT: The Weather Research and Forecasting (WRF) model is a next-generation mesoscale numerical weather prediction system. We
More informationDownloaded 08/28/12 to Redistribution subject to SIAM license or copyright; see
SIAM J. MATRIX ANAL. APPL. Vol. 26, No. 3, pp. 621 639 c 2005 Society for Industrial and Applied Mathematics ROW MODIFICATIONS OF A SPARSE CHOLESKY FACTORIZATION TIMOTHY A. DAVIS AND WILLIAM W. HAGER Abstract.
More informationParallel Eigensolver Performance on High Performance Computers
Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization
More informationNOAA Research and Development High Performance Compu3ng Office Craig Tierney, U. of Colorado at Boulder Leslie Hart, NOAA CIO Office
A survey of performance characteris3cs of NOAA s weather and climate codes across our HPC systems NOAA Research and Development High Performance Compu3ng Office Craig Tierney, U. of Colorado at Boulder
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationSCALABLE HYBRID SPARSE LINEAR SOLVERS
The Pennsylvania State University The Graduate School Department of Computer Science and Engineering SCALABLE HYBRID SPARSE LINEAR SOLVERS A Thesis in Computer Science and Engineering by Keita Teranishi
More informationFault Tolerance & Reliability CDA Chapter 2 Cyclic Polynomial Codes
Fault Tolerance & Reliability CDA 5140 Chapter 2 Cyclic Polynomial Codes - cylic code: special type of parity check code such that every cyclic shift of codeword is a codeword - for example, if (c n-1,
More informationab initio Electronic Structure Calculations
ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab
More informationHigh-Performance Scientific Computing
High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org
More informationPerformance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville
Performance of the fusion code GYRO on three four generations of Crays Mark Fahey mfahey@utk.edu University of Tennessee, Knoxville Contents Introduction GYRO Overview Benchmark Problem Test Platforms
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationPractical performance portability in the Parallel Ocean Program (POP)
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/227606785 Practical performance portability in the Parallel Ocean Program (POP) Article in
More informationHPC and High-end Data Science
HPC and High-end Data Science for the Power Grid Alex Pothen August 3, 2018 Outline High-end Data Science 1 3 Data Anonymization Contingency Analysis 2 4 Parallel Oscillation Monitoring 2 / 22 PMUs in
More informationA distributed packed storage for large dense parallel in-core calculations
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A
More informationVector Lane Threading
Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program
More informationStrassen s Algorithm for Tensor Contraction
Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,
More informationLecture 19. Architectural Directions
Lecture 19 Architectural Directions Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2 Final examination Announcements Thursday, March 17, in this room:
More informationGPU Computing Activities in KISTI
International Advanced Research Workshop on High Performance Computing, Grids and Clouds 2010 June 21~June 25 2010, Cetraro, Italy HPC Infrastructure and GPU Computing Activities in KISTI Hongsuk Yi hsyi@kisti.re.kr
More informationAnalysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing
Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing Prasanna Balaprakash, Leonardo A. Bautista Gomez, Slim Bouguerra, Stefan M. Wild, Franck Cappello, and Paul D. Hovland
More informationMulticore Parallelization of Determinant Quantum Monte Carlo Simulations
Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March
More informationEnvironment (Parallelizing Query Optimization)
Advanced d Query Optimization i i Techniques in a Parallel Computing Environment (Parallelizing Query Optimization) Wook-Shin Han*, Wooseong Kwak, Jinsoo Lee Guy M. Lohman, Volker Markl Kyungpook National
More informationMSC HPC Infrastructure Update. Alain St-Denis Canadian Meteorological Centre Meteorological Service of Canada
MSC HPC Infrastructure Update Alain St-Denis Canadian Meteorological Centre Meteorological Service of Canada Outline HPC Infrastructure Overview Supercomputer Configuration Scientific Direction 2 IT Infrastructure
More informationON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING
ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB What Is LINPACK? LINPACK is a package of mathematical
More informationShortest paths with negative lengths
Chapter 8 Shortest paths with negative lengths In this chapter we give a linear-space, nearly linear-time algorithm that, given a directed planar graph G with real positive and negative lengths, but no
More informationNuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory
Nuclear Physics and Computing: Exascale Partnerships Juan Meza Senior Scientist Lawrence Berkeley National Laboratory Nuclear Science and Exascale i Workshop held in DC to identify scientific challenges
More information5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)
5.1 Banded Storage u = temperature u= u h temperature at gridpoints u h = 1 u= Laplace s equation u= h u = u h = grid size u=1 The five-point difference operator 1 u h =1 uh (x + h, y) 2u h (x, y)+u h
More informationCommunication Avoiding LU Factorization using Complete Pivoting
Communication Avoiding LU Factorization using Complete Pivoting Implementation and Analysis Avinash Bhardwaj Department of Industrial Engineering and Operations Research University of California, Berkeley
More informationPerformance optimization of WEST and Qbox on Intel Knights Landing
Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University
More informationParallel Sparse Tensor Decompositions using HiCOO Format
Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, 8 @ SIAM ALA 8 Outline
More informationDirect solution methods for sparse matrices. p. 1/49
Direct solution methods for sparse matrices p. 1/49 p. 2/49 Direct solution methods for sparse matrices Solve Ax = b, where A(n n). (1) Factorize A = LU, L lower-triangular, U upper-triangular. (2) Solve
More informationCALU: A Communication Optimal LU Factorization Algorithm
CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9
More informationEnhancing Scalability of Sparse Direct Methods
Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.
More informationPopulation Estimation: Using High-Performance Computing in Statistical Research. Craig Finch Zia Rehman
Population Estimation: Using High-Performance Computing in Statistical Research Craig Finch Zia Rehman Statistical estimation Estimated Value Confidence Interval Actual Value Estimator: a rule for finding
More informationReview for the Midterm Exam
Review for the Midterm Exam 1 Three Questions of the Computational Science Prelim scaled speedup network topologies work stealing 2 The in-class Spring 2012 Midterm Exam pleasingly parallel computations
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationBalanced Dense Polynomial Multiplication on Multicores
Balanced Dense Polynomial Multiplication on Multicores Yuzhen Xie SuperTech Group, CSAIL MIT joint work with Marc Moreno Maza ORCCA, UWO ACA09, Montreal, June 26, 2009 Introduction Motivation: Multicore-enabling
More informationParallel Sparse LU Factorization on Second-class Message Passing Platforms
Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen kshen@cs.rochester.edu Department of Computer Science, University of Rochester ABSTRACT Several message passing-based
More informationConstruction of low complexity Array based Quasi Cyclic Low density parity check (QC-LDPC) codes with low error floor
Construction of low complexity Array based Quasi Cyclic Low density parity check (QC-LDPC) codes with low error floor Pravin Salunkhe, Prof D.P Rathod Department of Electrical Engineering, Veermata Jijabai
More informationEfficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 2, FEBRUARY 1998 109 Efficient Sparse LU Factorization with Partial Pivoting on Distributed Memory Architectures Cong Fu, Xiangmin Jiao,
More informationSparse linear solvers
Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky
More informationFAS and Solver Performance
FAS and Solver Performance Matthew Knepley Mathematics and Computer Science Division Argonne National Laboratory Fall AMS Central Section Meeting Chicago, IL Oct 05 06, 2007 M. Knepley (ANL) FAS AMS 07
More informationMemory-Access Optimization of Parallel Molecular Dynamics Simulation via Dynamic Data Reordering
Memory-Access Optimization of Parallel Molecular Dynamics Simulation via Dynamic Data Reordering Manaschai Kunaseth, Ken-ichi Nomura, Hikmet Dursun, Rajiv K. Kalia, Aiichiro Nakano, and Priya Vashishta
More informationEfficiency of Dynamic Load Balancing Based on Permanent Cells for Parallel Molecular Dynamics Simulation
Efficiency of Dynamic Load Balancing Based on Permanent Cells for Parallel Molecular Dynamics Simulation Ryoko Hayashi and Susumu Horiguchi School of Information Science, Japan Advanced Institute of Science
More informationBinding Performance and Power of Dense Linear Algebra Operations
10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique
More informationEEC 281 VLSI Digital Signal Processing Notes on the RRI-FFT
EEC 281 VLSI Digital Signal Processing Notes on the RRI-FFT Bevan Baas The attached description of the RRI-FFT is taken from a chapter describing the cached-fft algorithm and there are therefore occasional
More informationOn The Energy Complexity of Parallel Algorithms
On The Energy Complexity of Parallel Algorithms Vijay Anand Korthikanti Department of Computer Science University of Illinois at Urbana Champaign vkortho2@illinois.edu Gul Agha Department of Computer Science
More informationKey words. numerical linear algebra, direct methods, Cholesky factorization, sparse matrices, mathematical software, matrix updates.
ROW MODIFICATIONS OF A SPARSE CHOLESKY FACTORIZATION TIMOTHY A. DAVIS AND WILLIAM W. HAGER Abstract. Given a sparse, symmetric positive definite matrix C and an associated sparse Cholesky factorization
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationMaking electronic structure methods scale: Large systems and (massively) parallel computing
AB Making electronic structure methods scale: Large systems and (massively) parallel computing Ville Havu Department of Applied Physics Helsinki University of Technology - TKK Ville.Havu@tkk.fi 1 Outline
More informationApplications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices
Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.
More informationPerformance and Application of Observation Sensitivity to Global Forecasts on the KMA Cray XE6
Performance and Application of Observation Sensitivity to Global Forecasts on the KMA Cray XE6 Sangwon Joo, Yoonjae Kim, Hyuncheol Shin, Eunhee Lee, Eunjung Kim (Korea Meteorological Administration) Tae-Hun
More informationSPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics
SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS
More informationELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers
ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers Victor Yu and the ELSI team Department of Mechanical Engineering & Materials Science Duke University Kohn-Sham Density-Functional
More informationLU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version
1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2
More information