Dynamic Scheduling within MAGMA

Size: px
Start display at page:

Download "Dynamic Scheduling within MAGMA"

Transcription

1 Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing Laboratory ICL University of Tennessee, Knoxville

2 ICL, Knoxville - Tennessee 2 April 5, 2012 M. Faverge - MAGMA

3 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 3 April 5, 2012 M. Faverge - MAGMA

4 Hardware Trends 1 But Clock Frequency Scaling Has Been Replaced by Scaling Cores / Chip 1.E+07 Scale # cores instead of clock speed Multicore - Hybrid Hardware issue became software issue 1.E+06 Transistors (in Thousands) Frequency (MHz) 1.E+05 Cores 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00 1.E Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Berkeley ParLab 3 1 Figure from Kathy Yelick, Ten Ways to Waste a Parallel Computer. Chris Batten, and Krste Asanoviç Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç. 4 April 5, 2012 M. Faverge - MAGMA

5 Future Systems # GPU accelerated systems in Top Most likely hybrid design Multicore + GPU accelerators Today accelerators attached Future accelerators integrated Intel s MIC Knight s Corner AMD s Fusion Nvidia s Project Denver April 5, 2012 M. Faverge - MAGMA

6 Software generations Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - SCALAPACK, distributed memory: PBLAS Message passing 00 s: PLASMA, many-cores friendly: DAG scheduler, block data layout, some extra kernels MAGMA, GPU: GPU BLAS 10 s: What about many cores AND many GPUs AND distributed memory? 6 April 5, 2012 M. Faverge - MAGMA

7 Software generations Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - SCALAPACK, distributed memory: PBLAS Message passing 00 s: PLASMA, many-cores friendly: DAG scheduler, block data layout, some extra kernels MAGMA, GPU: GPU BLAS 10 s: What about many cores AND many GPUs AND distributed memory? 6 April 5, 2012 M. Faverge - MAGMA

8 Software generations Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - SCALAPACK, distributed memory: PBLAS Message passing 00 s: PLASMA, many-cores friendly: DAG scheduler, block data layout, some extra kernels MAGMA, GPU: GPU BLAS 10 s: What about many cores AND many GPUs AND distributed memory? 6 April 5, 2012 M. Faverge - MAGMA

9 Software generations Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - SCALAPACK, distributed memory: PBLAS Message passing 00 s: PLASMA, many-cores friendly: DAG scheduler, block data layout, some extra kernels MAGMA, GPU: GPU BLAS 10 s: What about many cores AND many GPUs AND distributed memory? critical path 6 April 5, 2012 M. Faverge - MAGMA

10 Software generations Software/Algorithms follow hardware evolution in time 70 s - LINPACK, vector operations: Level-1 BLAS operation 80 s - LAPACK, block, cache-friendly: Level-3 BLAS operation 90 s - SCALAPACK, distributed memory: PBLAS Message passing 00 s: PLASMA, many-cores friendly: DAG scheduler, block data layout, some extra kernels MAGMA, GPU: GPU BLAS 10 s: What about many cores AND many GPUs AND distributed memory? critical path 6 April 5, 2012 M. Faverge - MAGMA

11 Actual projects DAGuE Directed Acyclic Graph Unified Environment Target multi-gpus + multi-cores in distributed memory Scheduler relying on parametrized DAG to represent the dependencies DPLASMA and DSPARSE librairies MORSE Matrices Over Runtime Exascale Target multi-gpus + multi-cores in distributed memory Interface to use PLASMA algorithms on top of an external scheduler (StarPU) 7 April 5, 2012 M. Faverge - MAGMA

12 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 8 April 5, 2012 M. Faverge - MAGMA

13 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 9 April 5, 2012 M. Faverge - MAGMA

14 MAGMA 1.1 Linear algebra library for GPUs. LAPACK column-wise layout LAPACK-like C and Fortran interfaces CPU and GPU interfaces 10 April 5, 2012 M. Faverge - MAGMA

15 MAGMA 1.1 MAGMA BLAS GPU only kernels Same interface as CUBLAS Improves some of the CUBLAS routines MAGMA LAPACK Mostly GPU computations Hybrid kernels 1 GPU + CPU Blas multithreaded or not CPU and GPU interfaces Gflop/s matrix-vector multiply magma ssymv cublas ssymv magma dsymv cublas dsymv Matrix size 11 April 5, 2012 M. Faverge - MAGMA

16 Hybrid kernels in MAGMA BLAS-2 / Panel on the CPU BLAS-3 / Update on the GPU Trailing matrix A = QA Panel Look ahead 12 April 5, 2012 M. Faverge - MAGMA

17 Hybrid kernels in MAGMA BLAS-2 / Panel on the CPU BLAS-3 / Update on the GPU Trailing matrix Panel Look ahead 12 April 5, 2012 M. Faverge - MAGMA

18 Hybrid kernels in MAGMA BLAS-2 / Panel on the CPU BLAS-3 / Update on the GPU Trailing matrix Panel Look ahead 12 April 5, 2012 M. Faverge - MAGMA

19 Hybrid kernels in MAGMA BLAS-2 / Panel on the CPU BLAS-3 / Update on the GPU Panel Trailing matrix 12 April 5, 2012 M. Faverge - MAGMA

20 Hybrid kernels in MAGMA BLAS-2 / Panel on the CPU BLAS-3 / Update on the GPU Panel 12 April 5, 2012 M. Faverge - MAGMA

21 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 13 April 5, 2012 M. Faverge - MAGMA

22 Tile Algorithms (PLASMA) Parallelism is brought to the fore May require the redesign of linear algebra algorithms Remove unnecessary synchronization points DAG execution where nodes represent tasks and edges define dependencies between them Tile data layout Dynamic runtime system environment 14 April 5, 2012 M. Faverge - MAGMA

23 Data Layout LAPACK: column-major format PLASMA: tile format Improves cache locality Simplifies data transfert 15 April 5, 2012 M. Faverge - MAGMA

24 Tile QR algorithm First panel factorization and corresponding updates DAG for a 4 4 tiles matrix GEQRT ORMQR ORMQR TSQRT ORMQR TSMQR TSMQR TSQRT TSMQR TSMQR GEQRT TSMQR TSQRT TSMQR ORMQR TSMQR TSQRT TSMQR ORMQR TSMQR TSMQR TSQRT TSMQR TSMQR GEQRT TSMQR ORMQR TSQRT TSMQR GEQRT 16 April 5, 2012 M. Faverge - MAGMA

25 QR - 32x4 tile matrix Try to enlarge the DAG. 17 April 5, 2012 M. Faverge - MAGMA

26 CAQR - 32x4 tile matrix - 16 domains Try to enlarge the DAG = new algorithm. How to schedule this problem efficiently? 18 April 5, 2012 M. Faverge - MAGMA

27 Dynamic Scheduling Conceptually similar to out-of-order processor scheduling Dynamic runtime DAG scheduler Out-of-order execution flow of fine-grained tasks Task scheduling as soon as dependencies are satisfied Producer-Consumer 19 April 5, 2012 M. Faverge - MAGMA

28 Code example (QR) for (k = 0; k < min(mt, NT); k++){ inserttask(zgeqrt, Akk, INOUT ); for (n = k+1; n < NT; n++) inserttask(zunmqr, Akk, INPUT, Akn, INOUT ); for (m = k+1; m < MT; m++){ inserttask(ztsqrt, Akk, INOUT, Amk, INOUT); } } for (n = k+1; n < NT; n++) inserttask(ztsmqr, Amk, INPUT, Akn, INOUT, Amn, INOUT); 20 April 5, 2012 M. Faverge - MAGMA

29 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 21 April 5, 2012 M. Faverge - MAGMA

30 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 22 April 5, 2012 M. Faverge - MAGMA

31 Tile algorithms for Hybrid architectures Pros Cons No constraint about computation locality in the DAG description No explicit MPI communications easy composition between steps of algorithms No need to change the algorithm to take into account new architectures Lots of algorithms cannot be easily expressed as a DAG Tuning can be require to get good performances May require identical kernels on CPUs and GPUs (Input/Ouput) Granularity can t change easily 23 April 5, 2012 M. Faverge - MAGMA

32 Tile algorithms for Hybrid architectures Pros Cons No constraint about computation locality in the DAG description No explicit MPI communications easy composition between steps of algorithms No need to change the algorithm to take into account new architectures Lots of algorithms cannot be easily expressed as a DAG Tuning can be require to get good performances May require identical kernels on CPUs and GPUs (Input/Ouput) Granularity can t change easily 24 April 5, 2012 M. Faverge - MAGMA

33 Why StarPU? Pros: Memory management Task submission system similar to Quark Cost models which analyse in real time kernels efficiencies Several scheduling strategies available Using GPUs is straightforward No specific compiler Cons: No MPI support (before v1.0) WaR require copies from the user NUMA support is not optimized yet Too many scheduling strategies 25 April 5, 2012 M. Faverge - MAGMA

34 From Multicore to Hybrid Architectures Figure: PLASMA Architecture 26 April 5, 2012 M. Faverge - MAGMA

35 From Multicore to Hybrid Architectures Figure: MAGMA Architecture 26 April 5, 2012 M. Faverge - MAGMA

36 How does it works? One worker per unit CPU or couple CPU/GPU MAGMA kernels require a GPU AND a CPU Streams are not used Simple replacement of quark_insert_task by morse_insert_task CPU and GPU kernels must take the same inputs and return the same outputs 27 April 5, 2012 M. Faverge - MAGMA

37 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 28 April 5, 2012 M. Faverge - MAGMA

38 Impact of the scheduling policy Gflop/s Matrix order HEFT-TMDM-PR HEFT-TMDM HEFT-TM-PR HEFT-TM GREEDY Name Policy description greedy Greedy policy heft-tm HEFT based on Task duration Models (T data transfert + T computation ) heft-tm-pr heft-tm with data PRefetch heft-tmdp heft-tm with remote Data Penalty (αt data transfert + T computation ) heft-tmdp-pr heft-tmdp with data PRefetch 29 April 5, 2012 M. Faverge - MAGMA

39 Impact of the data penalty Matrix order heft-tm-pr 3.8 GB 57.2 GB GB GB heft-tmdm-pr 1.9 GB 16.3 GB 25.4 GB 41.6 GB Impact of the scheduling policy on the total amount of data transfers during sgeqrf. 30 April 5, 2012 M. Faverge - MAGMA

40 Scalability GPUs + 16 CPUs - Single 4 GPUs + 4 CPUs - Single 3 GPUs + 3 CPUs - Single 2 GPUs + 2 CPUs - Single 1 GPUs + 1 CPUs - Single 4 GPUs + 16 CPUs - Double 4 GPUs + 4 CPUs - Double 3 GPUs + 3 CPUs - Double 2 GPUs + 2 CPUs - Double 1 GPUs + 1 CPUs - Double Gflop/s Matrix order Scalability of sgeqrf and dgeqrf on Opteron-Tesla 31 April 5, 2012 M. Faverge - MAGMA

41 Scalability GPUs + 16 CPUs - Single 4 GPUs + 4 CPUs - Single 3 GPUs + 3 CPUs - Single 2 GPUs + 2 CPUs - Single 1 GPUs + 1 CPUs - Single 4 GPUs + 16 CPUs - Double 4 GPUs + 4 CPUs - Double 3 GPUs + 3 CPUs - Double 2 GPUs + 2 CPUs - Double 1 GPUs + 1 CPUs - Double Gflop/s Matrix order Scalability of sgeqrf and dgeqrf on Opteron-Tesla + 200Gflop/s but 12 cores = 150Gflop/s 31 April 5, 2012 M. Faverge - MAGMA

42 Scalability Kernel CPU GPU Speedup sgeqrt 9 Gflops 60 Gflops 6 stsqrt 12 Gflops 67 Gflops 6 sormqr 8.5 Gflops 227 Gflops 27 stsmqr 10 Gflops 285 Gflops 27 Task distribution observed on StarPU: sgeqrt: 20% of tasks on GPUs stsmqr: 92.5% of tasks on GPUs Taking advantage of heterogeneity! Only do what you are good for 32 April 5, 2012 M. Faverge - MAGMA

43 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 33 April 5, 2012 M. Faverge - MAGMA

44 GPU kernels can be different 1000 Mix variant 450 Mix variant Gflop/s Gflop/s Matrix order (a) SGETRF (96,2496) Matrix order (b) DGETRF (64,2048) 34 April 5, 2012 M. Faverge - MAGMA

45 GPU kernels can be different 1000 Mix-swp variant Mix variant 450 Mix-swp variant Mix variant Gflop/s Gflop/s Matrix order (c) SGETRF (96,2496) Matrix order (d) DGETRF (64,2048) 34 April 5, 2012 M. Faverge - MAGMA

46 GPU kernels can be different 1000 Mix-trtri variant Mix-swp variant Mix variant Mix-trtri variant Mix-swp variant Mix variant Gflop/s Gflop/s Matrix order (e) SGETRF (96,2496) Matrix order (f) DGETRF (64,2048) 34 April 5, 2012 M. Faverge - MAGMA

47 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 35 April 5, 2012 M. Faverge - MAGMA

48 Auto-Tuning 36 April 5, 2012 M. Faverge - MAGMA

49 Auto-Tuning 37 April 5, 2012 M. Faverge - MAGMA

50 Intro. MAGMA Tile-Algo How to choose IB and NB? Auto-Tuning??? 38 April 5, 2012, M. Faverge - MAGMA Morse Concl.

51 How to choose IB and NB? Run the experiments on a set of different (IB, NB) couples: Only 1 GPU Matrix size of x NB 5120 on Tesla (10240 on Fermi) Register the average performance of the update kernel Register the performance of the global factorization 39 April 5, 2012 M. Faverge - MAGMA

52 Choice of (IB, NB) for sgetrf (4 Tesla C1050) IB = 128 IB = 96 IB = 64 IB = IB = 128 IB = 96 IB = 64 IB = Gflop/s 150 Gflop/s NB NB (a) Average performance of the update (b) Average performance of the panel 1000 (128, 1280) Gflop/s Matrix order (c) SGETRF on 4 GPU 40 April 5, 2012 M. Faverge - MAGMA

53 Choice of (IB, NB) for sgetrf (4 Tesla C1050) IB = 128 IB = 96 IB = 64 IB = IB = 128 IB = 96 IB = 64 IB = Gflop/s 150 Gflop/s NB NB (a) Average performance of the update (b) Average performance of the panel 1000 (32, 1984) (128, 1280) Gflop/s Matrix order (c) SGETRF on 4 GPU 40 April 5, 2012 M. Faverge - MAGMA

54 Choice of (IB, NB) for sgetrf (4 Tesla C1050) IB = 128 IB = 96 IB = 64 IB = IB = 128 IB = 96 IB = 64 IB = Gflop/s 150 Gflop/s NB NB (a) Average performance of the update (b) Average performance of the panel 1000 (64, 1984) (32, 1984) (128, 1280) Gflop/s Matrix order (c) SGETRF on 4 GPU 40 April 5, 2012 M. Faverge - MAGMA

55 LU on 3 Fermis + 2 Intel hexacore All 3 GPUs 2 GPUs 1 GPU 500 All 3 GPUs 2 GPUs 1 GPU 12 CPUs Gflop/s Gflop/s Matrix order Matrix order (d) SGETRF (96,2496) (e) DGETRF (64,2048) 41 April 5, 2012 M. Faverge - MAGMA

56 Summary of the one-sided factorizations 1 Cholesky 1.3 TFlops VS 1.1 with static scheduling 2 QR 1. TFlops VS 0.8 with static scheduling 3 LU 1.1 TFlops VS 0.9 with static scheduling StarPU brings the performance of the CPUs Improvment due to CPUs is higher than CPUs theoritical performance 42 April 5, 2012 M. Faverge - MAGMA

57 Outline 1 Introduction 2 MAGMA 3 Tile Algorithms 4 Tile Algorithms on Top of StarPU From Multicore to Hybrid Architectures Scheduling policy Adapt the kernels to GPUs Tuning: How to choose a good couple (IB, NB)? 5 Conclusion and Future Works 43 April 5, 2012 M. Faverge - MAGMA

58 Conclusion Fermi gives really good results in double precision Problem is the huge difference between CPU and GPU Numerical stabilities due to pairwise pivoting in LU Provides: Cholesky, QR, CAQR and LU factorizations and solves Subset of the BLAS-3 subroutines Cholesky inversion Move to two sided factorizations (eigenvalue and singular value problems) Move to distributed memory 44 April 5, 2012 M. Faverge - MAGMA

59 Useful Links PLASMA = MAGMA = MORSE = StarPU = 45 April 5, 2012 M. Faverge - MAGMA

60 Thank you! 46 April 5, 2012 M. Faverge - MAGMA

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department

More information

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Modeling and Tuning Parallel Performance in Dense Linear Algebra Modeling and Tuning Parallel Performance in Dense Linear Algebra Initial Experiences with the Tile QR Factorization on a Multi Core System CScADS Workshop on Automatic Tuning for Petascale Systems Snowbird,

More information

A hybrid Hermitian general eigenvalue solver

A hybrid Hermitian general eigenvalue solver Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems Noname manuscript No. (will be inserted by the editor) Power Profiling of Cholesky and QR Factorizations on Distributed s George Bosilca Hatem Ltaief Jack Dongarra Received: date / Accepted: date Abstract

More information

Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems

Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems Emmanuel AGULLO (INRIA HiePACS team) Julien LANGOU (University of Colorado Denver) Joint work with University

More information

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March

More information

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures A parallel tiled solver for dense symmetric indefinite systems on multicore architectures Marc Baboulin, Dulceneia Becker, Jack Dongarra INRIA Saclay-Île de France, F-91893 Orsay, France Université Paris

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

INITIAL INTEGRATION AND EVALUATION

INITIAL INTEGRATION AND EVALUATION INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52

More information

arxiv: v1 [cs.ms] 18 Nov 2016

arxiv: v1 [cs.ms] 18 Nov 2016 Bidiagonalization with Parallel Tiled Algorithms Mathieu Faverge,2, Julien Langou 3, Yves Robert 2,4, and Jack Dongarra 2,5 Bordeaux INP, CNRS, INRIA et Université de Bordeaux, France 2 University of Tennessee,

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

Hierarchical QR factorization algorithms for multi-core cluster systems

Hierarchical QR factorization algorithms for multi-core cluster systems Hierarchical QR factorization algorithms for multi-core cluster systems Jack Dongarra Mathieu Faverge Thomas Herault Julien Langou Yves Robert University of Tennessee Knoxville, USA University of Colorado

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano Introduction Introduction We wanted to parallelize a serial algorithm for the pivoted Cholesky factorization

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Architecture-Aware Algorithms and Software for Peta and Exascale Computing

Architecture-Aware Algorithms and Software for Peta and Exascale Computing Architecture-Aware Algorithms and Software for Peta and Exascale Computing Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 4/25/2011 1 H. Meuer, H. Simon, E.

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Matrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing)

Matrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing) Matrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing) alfredo.buttari@enseeiht.fr for an up-to-date version of the slides: http://buttari.perso.enseeiht.fr Introduction Objective

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSITY OF TENNESSEE OAK RIDGE NATIONAL LAB What Is LINPACK? LINPACK is a package of mathematical

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Reducing the Amount of Pivoting in Symmetric Indefinite Systems

Reducing the Amount of Pivoting in Symmetric Indefinite Systems Reducing the Amount of Pivoting in Symmetric Indefinite Systems Dulceneia Becker 1, Marc Baboulin 4, and Jack Dongarra 1,2,3 1 University of Tennessee, USA [dbecker7,dongarra]@eecs.utk.edu 2 Oak Ridge

More information

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. X, M YYYY 1 ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems Sameh Abdulah, Hatem Ltaief, Ying Sun,

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Practical Combustion Kinetics with CUDA

Practical Combustion Kinetics with CUDA Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

Package magma. February 15, 2013

Package magma. February 15, 2013 Package magma February 15, 2013 Title Matrix Algebra on GPU and Multicore Architectures Version 0.2.2-1 Date 2010-08-27 Author Brian J Smith Maintainer Brian J Smith

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization

Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan 2, Robert A. van de Geijn 2, and Field

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

Porting a sphere optimization program from LAPACK to ScaLAPACK

Porting a sphere optimization program from LAPACK to ScaLAPACK Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference

More information

An Integrative Model for Parallelism

An Integrative Model for Parallelism An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures

Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures Grégoire Pichon, Azzam Haidar, Mathieu Faverge, Jakub Kurzak To cite this version: Grégoire Pichon, Azzam Haidar, Mathieu

More information

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Théo Mary, Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Jack Dongarra presenter 1 Low-Rank

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Linear Systems Performance Report

Linear Systems Performance Report 8 Linear Systems Performance Report Jakub Kurzak Mark Gates Ichitaro Yamazaki Ali Charara Asim YarKhan Jamie Finney Gerald Ragghianti Piotr Luszczek Jack Dongarra Innovative Computing Laboratory October

More information

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): ) ENHANCING PERFORMANCE OF TALL-SKINNY QR FACTORIZATION USING FPGAS Abid Rafique, Nachiket Kapre and George A. Constantinides Electrical and Electronic Engineering Department Imperial College London London,

More information

Spécialité: Informatique. par. Marc Baboulin. Maître de conférences, Université Paris-Sud Chaire Inria Saclay - Île-de-France

Spécialité: Informatique. par. Marc Baboulin. Maître de conférences, Université Paris-Sud Chaire Inria Saclay - Île-de-France HABILITATION A DIRIGER DES RECHERCHES présentée à l Université Paris-Sud Spécialité: Informatique par Marc Baboulin Maître de conférences, Université Paris-Sud Chaire Inria Saclay - Île-de-France Résolutions

More information

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009 III.1 Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures Based on the presentation at UC Berkeley, October 7, 2009 Background and motivation Running time of an algorithm

More information

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods Marc Baboulin 1, Xiaoye S. Li 2 and François-Henry Rouet 2 1 University of Paris-Sud, Inria Saclay, France 2 Lawrence Berkeley

More information

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS

More information

TOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM

TOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM TOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM AZZAM HAIDAR, HATEM LTAIEF, AND JACK DONGARRA Abstract. Classical solvers for the dense symmetric eigenvalue

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

Roundoff Error. Monday, August 29, 11

Roundoff Error. Monday, August 29, 11 Roundoff Error A round-off error (rounding error), is the difference between the calculated approximation of a number and its exact mathematical value. Numerical analysis specifically tries to estimate

More information

Tiled QR factorization algorithms

Tiled QR factorization algorithms Tiled QR factorization algorithms Henricus Bouwmeester, Mathias Jacuelin, Julien Langou, Yves Robert To cite this version: Henricus Bouwmeester, Mathias Jacuelin, Julien Langou, Yves Robert. Tiled QR factorization

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

Tiled QR factorization algorithms

Tiled QR factorization algorithms INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE Tiled QR factorization algorithms Henricus Bouwmeester Mathias Jacuelin Julien Langou Yves Robert N 7601 Avril 2011 Distributed and High

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 212 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training

More information

CS 542G: Conditioning, BLAS, LU Factorization

CS 542G: Conditioning, BLAS, LU Factorization CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

Porting a Sphere Optimization Program from lapack to scalapack

Porting a Sphere Optimization Program from lapack to scalapack Porting a Sphere Optimization Program from lapack to scalapack Paul C. Leopardi Robert S. Womersley 12 October 2008 Abstract The sphere optimization program sphopt was originally written as a sequential

More information

Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures

Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures Khairul Kabir University of Tennessee kkabir@vols.utk.edu Azzam Haidar

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common

More information

Parallel sparse direct solvers for Poisson s equation in streamer discharges

Parallel sparse direct solvers for Poisson s equation in streamer discharges Parallel sparse direct solvers for Poisson s equation in streamer discharges Margreet Nool, Menno Genseberger 2 and Ute Ebert,3 Centrum Wiskunde & Informatica (CWI), P.O.Box 9479, 9 GB Amsterdam, The Netherlands

More information

A distributed packed storage for large dense parallel in-core calculations

A distributed packed storage for large dense parallel in-core calculations CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2007; 19:483 502 Published online 28 September 2006 in Wiley InterScience (www.interscience.wiley.com)..1119 A

More information

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction Peter Benner 1, Ernesto Dufrechou 2, Pablo Ezzatti 2, Pablo Igounet 2, Enrique S. Quintana-Ortí 3, and Alfredo Remón

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

Minimizing Communication in Linear Algebra. James Demmel 15 June

Minimizing Communication in Linear Algebra. James Demmel 15 June Minimizing Communication in Linear Algebra James Demmel 15 June 2010 www.cs.berkeley.edu/~demmel 1 Outline What is communication and why is it important to avoid? Direct Linear Algebra Lower bounds on

More information

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners José I. Aliaga Leveraging task-parallelism in energy-efficient ILU preconditioners Universidad Jaime I (Castellón, Spain) José I. Aliaga

More information

The Performance Evolution of the Parallel Ocean Program on the Cray X1

The Performance Evolution of the Parallel Ocean Program on the Cray X1 The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott

More information

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign Collaborators Thesis Advisors Denis Barthou (Labri/INRIA

More information

Panorama des modèles et outils de programmation parallèle

Panorama des modèles et outils de programmation parallèle Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &

More information

On GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations

On GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations Max Planck Institute Magdeburg Preprints Martin Köhler Jens Saak On GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations MAX PLANCK INSTITUT FÜR DYNAMIK KOMPLEXER TECHNISCHER

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra By: David McQuilling; Jesus Caban Deng Li Jan.,31,006 CS51 Solving Linear Equations u + v = 8 4u + 9v = 1 A x b 4 9 u v = 8 1 Gaussian Elimination Start with the matrix representation

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information

LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment LAPACK Working Note #224 QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou Dpt of Electrical Engineering

More information

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 9th Scheduling for Large Scale Systems Workshop, Lyon S. Moustafa, M. Faverge, L. Plagne, and P. Ramet S. Moustafa, M.

More information

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors 20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,

More information