Saving Energy in Sparse and Dense Linear Algebra Computations

Size: px
Start display at page:

Download "Saving Energy in Sparse and Dense Linear Algebra Computations"

Transcription

1 Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain de Castellón, Spain at Austin, TX quintana@icc.uji.es Saving Energy in Sparse and Dense Linear Algebra Computations 1 SIAM PPSC Savannah (GA), Feb. 2012

2 Introduction and motivation Motivation Reduce energy consumption! Costs over lifetime of an HPC facility often exceed acquisition costs Carbon dioxide is a risk for health and environment Heat reduces hardware reliability Personal view Hardware features mechanisms and modes to save energy Scientific apps. are in general energy-oblivious Saving Energy in Sparse and Dense Linear Algebra Computations 2 SIAM PPSC Savannah (GA), Feb. 2012

3 Introduction and motivation Motivation Reduce energy consumption! Costs over lifetime of an HPC facility often exceed acquisition costs Carbon dioxide is a risk for health and environment Heat reduces hardware reliability Personal view Hardware features mechanisms and modes to save energy Scientific apps. are in general energy-oblivious Saving Energy in Sparse and Dense Linear Algebra Computations 2 SIAM PPSC Savannah (GA), Feb. 2012

4 Introduction and motivation Motivation (Cont d) Scheduling of task parallel linear algebra algorithms Examples: Cholesky, QR and LU factorizations, etc. ILUPACK Energy saving tools available for multi-core processors Example: Dynamic Voltage and Frequency Scaling (DVFS) Scheduling tasks + DVFS Energy-aware execution on multi-core processors Different scheduling strategies: SRA: Reduce the frequency of cores that will execute non-critical tasks to decrease idle times without sacrificing total performance of the algorithm RIA: Execute all tasks at highest frequency to enjoy longer inactive periods Saving Energy in Sparse and Dense Linear Algebra Computations 3 SIAM PPSC Savannah (GA), Feb. 2012

5 Introduction and motivation Motivation (Cont d) Scheduling of task parallel linear algebra algorithms Examples: Cholesky, QR and LU factorizations, etc. ILUPACK Energy saving tools available for multi-core processors Example: Dynamic Voltage and Frequency Scaling (DVFS) Scheduling tasks + DVFS Energy-aware execution on multi-core processors Different scheduling strategies: SRA: Reduce the frequency of cores that will execute non-critical tasks to decrease idle times without sacrificing total performance of the algorithm RIA: Execute all tasks at highest frequency to enjoy longer inactive periods Saving Energy in Sparse and Dense Linear Algebra Computations 3 SIAM PPSC Savannah (GA), Feb. 2012

6 Introduction and motivation Outline 1 LU factorization with partial pivoting 1 Slack Reduction Algorithm 2 Race-to-Idle Algorithm 3 Simulation 4 Experimental Results 2 ILUPACK 1 Race-to-Idle Algorithm 2 Experimental Results 3 Conclusions Saving Energy in Sparse and Dense Linear Algebra Computations 4 SIAM PPSC Savannah (GA), Feb. 2012

7 1. LU Factorization with Partial Pivoting (LUPP) LU factorization Factor A = LU, with L/U R n n unit lower/upper triangular matrices For numerical stability, permutations are introduced to prevent operation with small pivot elements PA = LU, with P R n n a permutation matrix that interchanges rows of A Saving Energy in Sparse and Dense Linear Algebra Computations 5 SIAM PPSC Savannah (GA), Feb. 2012

8 Blocked algorithm for LUPP (no PP, for simplicity) for k = 1 : s do A k:s,k = L k:s,k U kk for j = k + 1 : s do A kj L 1 kk A kj A k+1:s,j A k+1:s,j A k+1:s,k A kj end for end for LU factorization Triangular solve Matrix-matrix product DAG with a matrix consisting of 5 5 blocks (s=5) T 51 M 51 T 52 M 52 T 21 M 21 G 22 G 11 T 32 M 32 G 33 T 53 M 53 T 54 M 54 G 55 T 31 M 31 T 43 M 43 G 44 T 41 M 41 T 42 M 42 Saving Energy in Sparse and Dense Linear Algebra Computations 6 SIAM PPSC Savannah (GA), Feb. 2012

9 Blocked algorithm for LUPP (no PP, for simplicity) for k = 1 : s do A k:s,k = L k:s,k U kk for j = k + 1 : s do A kj L 1 kk A kj A k+1:s,j A k+1:s,j A k+1:s,k A kj end for end for LU factorization Triangular solve Matrix-matrix product DAG with a matrix consisting of 5 5 blocks (s=5) T 51 M 51 T 52 M 52 T 21 M 21 G 22 G 11 T 32 M 32 G 33 T 53 M 53 T 54 M 54 G 55 T 31 M 31 T 43 M 43 G 44 T 41 M 41 T 42 M 42 Saving Energy in Sparse and Dense Linear Algebra Computations 6 SIAM PPSC Savannah (GA), Feb. 2012

10 Slack Reduction Algorithm 1.1 Slack Reduction Algorithm Strategy Search for slacks (idle periods) in the DAG associated with the algorithm, and try to minimize them applying e.g. DVFS 1 Search slacks via the Critical Path Method (CPM): DAG of dependencies Nodes Tasks Edges Dependencies ES i /LF i : Early and latest times task T i, with cost C i, can start/finalize without increasing the total execution time of the algorithm S i : Slack (time) task T i can be delayed without increasing the total execution time Critical path: Collection of tasks directly connected, from initial to final node of the graph, with total slack = 0 2 Minimize the slack of tasks with S i > 0, reducing their execution frequency via SRA Saving Energy in Sparse and Dense Linear Algebra Computations 7 SIAM PPSC Savannah (GA), Feb. 2012

11 Slack Reduction Algorithm 1.1 Slack Reduction Algorithm Strategy Search for slacks (idle periods) in the DAG associated with the algorithm, and try to minimize them applying e.g. DVFS 1 Search slacks via the Critical Path Method (CPM): DAG of dependencies Nodes Tasks Edges Dependencies ES i /LF i : Early and latest times task T i, with cost C i, can start/finalize without increasing the total execution time of the algorithm S i : Slack (time) task T i can be delayed without increasing the total execution time Critical path: Collection of tasks directly connected, from initial to final node of the graph, with total slack = 0 2 Minimize the slack of tasks with S i > 0, reducing their execution frequency via SRA Saving Energy in Sparse and Dense Linear Algebra Computations 7 SIAM PPSC Savannah (GA), Feb. 2012

12 Slack Reduction Algorithm Application of CPM to the DAG of the LUPP of a 5 5 blocked matrix: T 51 M 51 T 52 M 52 T 21 T 53 M 53 M 21 G 22 G 11 T 32 M 32 G 33 T 31 T 43 M 43 G 44 M 31 T 42 M 42 M 41 T 41 T 54 M 54 G 55 Task C ES LF S G T T T T Duration of tasks (cost, C) of type G and M depends on the iteration! Evaluate the time of 1 flop for each type of task and, from its theoretical cost, approximate the execution time.. Saving Energy in Sparse and Dense Linear Algebra Computations 8 SIAM PPSC Savannah (GA), Feb. 2012

13 Slack Reduction Algorithm Slack Reduction Algorithm 1 Frequency assignment: Set initial frequencies 2 Critical subpath extraction 3 Slack reduction 1 Frequency assignment G 11 T 21 T 31 T 41 T 51 M 21 M 31 M 51 G 22 M 41 T 52 T 32 T 42 M 32 M 42 M 52 T 53 M 53 G 33 T 43 M 43 G 44 T 54 M 54 G 55 Discrete range of frequencies: {, 1.50, 1.20, 1.00, 0.80} GHz Duration of tasks of type G and M depends on the iteration. Initially, all tasks to run at the highest frequency: GHz Saving Energy in Sparse and Dense Linear Algebra Computations 9 SIAM PPSC Savannah (GA), Feb. 2012

14 Slack Reduction Algorithm Slack Reduction Algorithm 1 Frequency assignment 2 Critical subpath extraction: Identify critical subpaths 3 Slack reduction 2 Critical subpath extraction G 11 T 21 T 31 T 41 T 51 M 21 M 31 M 51 G 22 M 41 T 52 T 32 T 42 M 32 M 42 M 52 T 53 M 53 G 33 T 43 M 43 G 44 T 54 M 54 G 55 Critical subpath extraction: 1 Identify and extract the critical (sub)path(s) 2 Eliminate the graph nodes/edges that belong to it 3 Repeat process until the graph is empty Results: Critical path: CP 0 3 critical subpaths (from largest to shortest): CP 1 > CP 2 > CP 3 Saving Energy in Sparse and Dense Linear Algebra Computations 10 SIAM PPSC Savannah (GA), Feb. 2012

15 Slack Reduction Algorithm Slack Reduction Algorithm 1 Frequency assignment 2 Critical subpath extraction 3 Slack reduction: Determine execution frequency for each task 3 Slack reduction G 11 T 21 T T T M 21 M M G 22 M Slack reduction algorithm: T T 32 T M 32 M M T M G 33 T 43 M 43 G 44 T 54 1 Reduce frequency of tasks of the largest unprocessed subpath 2 Check the lowest frequency reduction ratio for each task of that subpath 3 Repeat process until all subpaths are processed M 54 G 55 Results: Tasks of CSP 1, CSP 2 and CSP 3 are assigned to run at 1.5 GHz! Saving Energy in Sparse and Dense Linear Algebra Computations 11 SIAM PPSC Savannah (GA), Feb. 2012

16 Race-to-Idle Algorithm 1.2 Race-to-Idle Algorithm Race-to-Idle complete execution as soon as possible by executing tasks of the algorithm at the highest frequency to enjoy longer inactive periods Tasks are executed at highest frequency During idle periods CPU frequency is reduced to lowest possible Why? Current processors are quite efficient at saving power when idle Power of an idle core is much lower than power during working periods DAG requires no processing, unlike SRA Saving Energy in Sparse and Dense Linear Algebra Computations 12 SIAM PPSC Savannah (GA), Feb. 2012

17 Simulation 1.3 Simulation Use of a simulator to evaluate the performance of the two strategies Input parameters: DAG capturing tasks and dependencies of a blocked algorithm and frequencies recommended by the Slack Reduction Algorithm and Race-to-Idle Algorithm A simple description of the target architecture: Number of sockets (physical processors) Number of cores per socket Discrete range of frequencies and associated voltages Frequency changes at core/socket level Collection of real power for each combination of frequency idle/busy state per core Cost (overhead) required to perform frequency changes Static priority list scheduler: Duration of tasks at each available frequency is known in advance Tasks that lie on critical path are prioritized Saving Energy in Sparse and Dense Linear Algebra Computations 13 SIAM PPSC Savannah (GA), Feb. 2012

18 Simulation AMD Opteron 6128 (8 cores): Evaluate the time of 1 flop for each type of task and, from its theoretical cost, approximate the execution time Frequency changes at core level, with f {, 1.50, 1.20, 1.00, 0.80} GHz Blocked algorithm for LUPP: Metrics: Simulation independent of actual implementation (LAPACK, libflame, etc.) Matrix size n from 768 to 5,632; block size: b = 256 Compare simulated time and energy consumption of original algorithm with those of modified SRA/RIA algorithms Simulated time T SRA/RIA T original Impact of SRA/RIA on time IT = T SRA/RIA T original 100 Simulated energy E SRA/RIA = P n i=1 Wn Tn E original = v 2 T(f max) Impact of SRA/RIA on energy IE = E SRA/RIA E original 100 Saving Energy in Sparse and Dense Linear Algebra Computations 14 SIAM PPSC Savannah (GA), Feb. 2012

19 Simulation Impact of SRA/RIA on simulated time/energy for LUPP: only power/energy due to workload (application)! RIA SRA Time Energy RIA SRA % Impact on time % Impact on energy Matrix size (n) Matrix size (n) SRA: Time is compromised, increasing the consumption for largest problem sizes Increase in execution time due to SRA being oblivious to the actual resources RIA: Time is not compromised and consumption is reduced for large problem sizes Saving Energy in Sparse and Dense Linear Algebra Computations 15 SIAM PPSC Savannah (GA), Feb. 2012

20 Simulation Impact of SRA/RIA on simulated time/energy for LUPP: only power/energy due to workload (application)! RIA SRA Time Application energy RIA SRA % Impact on time % Impact on application energy Matrix size (n) Matrix size (n) SRA: Time is compromised, increasing the consumption for largest problem sizes Increase in execution time due to the SRA being oblivious to the actual resources RIA: Time is not compromised and consumption is reduced for large problem sizes Saving Energy in Sparse and Dense Linear Algebra Computations 16 SIAM PPSC Savannah (GA), Feb. 2012

21 1.4 Experimental Results LUPP Experimental Results Integrate RIA into a runtime for task-parallel execution of dense linear algebra algorithms: libflame+supermatrix Algorithm Worker Th. 1 Core 1 Symbolic Analysis Queue of pending tasks + dependencies (DAG) Dispatch Queue of ready tasks (no dependencies) Worker Th. 2. Worker Th. p Core 2. Core p Two energy-aware techniques: RIA1: Reduce operation frequency when there are no ready tasks (Linux governor) RIA2: Remove polling when there are no ready tasks (while ensuring a quick recovery) Applicable to any runtime: SuperMatrix, SMPSs, Quark, etc! Saving Energy in Sparse and Dense Linear Algebra Computations 17 SIAM PPSC Savannah (GA), Feb. 2012

22 Experimental Results Doing nothing well David E. Culler AMD Opteron 6128, 1 core: Power for different thread activities MKL dgemm at GHz Blocking at 800 MHz Polling at GHz Polling at 800 MHz Power (Watts) Time (s) Saving Energy in Sparse and Dense Linear Algebra Computations 18 SIAM PPSC Savannah (GA), Feb. 2012

23 Experimental Results AMD Opteron 6128 (8 cores): Frequency changes at core level, with fmax = GHz and f min = 800 MHz Blocked algorithm for LUPP: Routine FLA LU piv from libflame v5.0 Metrics: Matrix size n from 2,048 to 12,288; block size: b = 256 Compare actual time and energy consumption of original runtime with those of modified RIA1/RIA2 runtime Actual time T RIA T original Impact of RIA on time IT = T RIA T original 100 Actual energy E RIA and E original measured using internal powermeter with sampling f =25 Hz Impact of RIA on energy IE = E RIA E original 100 Saving Energy in Sparse and Dense Linear Algebra Computations 19 SIAM PPSC Savannah (GA), Feb. 2012

24 Experimental Results Impact of RIA1/RIA2 on actual time/energy for LUPP: only power/energy due to workload (application)! RIA1 RIA2 RIA1+RIA2 Time RIA1 RIA2 RIA1+RIA2 Energy % Impact on time % Impact on energy Matrix size (n) Matrix size (n) Small impact on execution time Consistent savings around 5 % for RIA1+RIA2: Not much hope for larger savings because there is no opportunity for that: the cores are busy most of the time Saving Energy in Sparse and Dense Linear Algebra Computations 20 SIAM PPSC Savannah (GA), Feb. 2012

25 Experimental Results Impact of RIA1/RIA2 on actual time/energy for LUPP: only power/energy due to workload (application)! % Impact on time RIA1 RIA2 RIA1+RIA2 Time % Impact on application energy RIA1 RIA2 RIA1+RIA2 Application energy Matrix size (n) Matrix size (n) Small impact on execution time Consistent savings around 7 8% for RIA1+RIA2: Not much hope for larger savings because there is no opportunity for that: the cores are busy most of the time Saving Energy in Sparse and Dense Linear Algebra Computations 21 SIAM PPSC Savannah (GA), Feb. 2012

26 ILUPACK 2. ILUPACK Sequential version M. Bollhöfer Iterative solver for large-scale sparse linear systems Multilevel ILU preconditioners for general and complex symmetric/hermitian positive definite systems Based on inverse-based ILUs: incomplete LU decompositions that control the growth of the inverse triangular factors Multi-threaded version J.I. Aliaga, M. Bollhöfer, A. F. Martín, E. S. Quintana-Ortí Real symmetric positive definite systems Construction of preconditioner and PCG solver Algebraic parallelization based on a task tree Leverage task-parallelism in the tree Dynamic scheduling via tailored run-time (OpenMP) Saving Energy in Sparse and Dense Linear Algebra Computations 22 SIAM PPSC Savannah (GA), Feb. 2012

27 ILUPACK PA A GA (1,2) (3,1) (1,4) (2,1) (2,2) TT (3,1) (1,1) (3,1) (1,3) (2,1) (2,2) (1,1) (1,2) (1,3) (1,4) Saving Energy in Sparse and Dense Linear Algebra Computations 23 SIAM PPSC Savannah (GA), Feb. 2012

28 ILUPACK Integrate RIA into multi-threaded version of ILUPACK Problem data: A Worker Th. 1 Core 1 Partitioning: METIS/SCOTCH Queue of pending tasks (binary tree). Worker Th. 2. Worker Th. p Core 2. Core p Queues of ready tasks (no dependencies) Same energy-aware techniques: RIA1+RIA2: Reduce operation frequency when there are no ready tasks (Linux governor) and remove polling when there are no ready tasks (while ensuring a quick recovery) Saving Energy in Sparse and Dense Linear Algebra Computations 24 SIAM PPSC Savannah (GA), Feb. 2012

29 ILUPACK AMD Opteron 6128 (8 cores) Linear system associated with Laplacian equation (n 16M) Impact of RIA1+RIA2 on actual power/application power for ILUPACK (only preconditioner): % Impact on power % Impact on application power Power 1.6 RIA1+RIA Sample Application power 1.6 RIA1+RIA Sample Saving Energy in Sparse and Dense Linear Algebra Computations 25 SIAM PPSC Savannah (GA), Feb. 2012

30 3. Conclusions Conclusions Goal: Combine scheduling+dvfs to save energy during the execution of linear algebra algorithms on multi-threaded architectures For (dense) LUPP: Slack Reduction Algorithm DAG requires a processing Currently does not take into account number of resources Increases execution time as matrix size grows Increases also energy consumption Race-to-Idle Algorithm Algorithm is applied on-the-fly, no pre-processing needed Maintains in all of cases execution time Reduce application energy consumption (around 7 8%) As the ratio between the problem size and the number of resources grows, the opportunities to save energy decrease! Saving Energy in Sparse and Dense Linear Algebra Computations 26 SIAM PPSC Savannah (GA), Feb. 2012

31 Conclusions For (sparse) ILUPACK: Race-to-Idle Algorithm DAG requires no processing Algorithm is applied on-the-fly Maintains in all of cases execution time Reduce power dissipated by application up to 40% Significant opportunities to save energy! General: We need architectures that know how to do nothing better! Saving Energy in Sparse and Dense Linear Algebra Computations 27 SIAM PPSC Savannah (GA), Feb. 2012

32 More information Improving power-efficiency of dense linear algebra algorithms on multi-core processors via slack control P. Alonso, M. F. Dolz, R. Mayo, E. S. Quintana-Ortí Workshop on Optimization Issues in Energy Efficient Distributed Systems OPTIM 2011, Istanbul (Turkey), July 2011 DVFS-control techniques for dense linear algebra operations on multi-core processors P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí 2nd Int. Conf. on Energy-Aware High Performance Computing EnaHPC 2011, Hamburg (Germany), Sept Saving energy in the LU factorization with partial pivoting on multi-core processors P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí 20th Euromicro Conf. on Parallel, Distributed and Network based Processing PDP 2012, Garching (Germany). Feb Saving Energy in Sparse and Dense Linear Algebra Computations 28 SIAM PPSC Savannah (GA), Feb. 2012

33 Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain de Castellón, Spain at Austin, TX quintana@icc.uji.es Saving Energy in Sparse and Dense Linear Algebra Computations 29 SIAM PPSC Savannah (GA), Feb. 2012

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors 20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,

More information

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners José I. Aliaga Leveraging task-parallelism in energy-efficient ILU preconditioners Universidad Jaime I (Castellón, Spain) José I. Aliaga

More information

Power-Aware Execution of Sparse and Dense Linear Algebra Libraries

Power-Aware Execution of Sparse and Dense Linear Algebra Libraries Power-Aware Execution of Sparse and Dense Linear Algebra Libraries Enrique S. Quintana-Ortí quintana@icc.uji.es Power-aware execution of linear algebra libraries 1 CECAM Lausanne, Sept. 2011 Motivation

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

Tools for Power and Energy Analysis of Parallel Scientific Applications

Tools for Power and Energy Analysis of Parallel Scientific Applications The 41st International Conference on Parallel Processing Tools for Power and Energy Analysis of Parallel Scientific Applications Pedro Alonso 1, Rosa M. Badia 2, Jesús Labarta 2, María Barreda 3, Manuel

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra

Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra FLAME Working Note #73 Pedro Alonso, Manuel F. Dolz 2, Francisco D. Igual 3, Rafael Mayo 4, and Enrique S. Quintana-Ortí

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

Parallel Performance Theory

Parallel Performance Theory AMS 250: An Introduction to High Performance Computing Parallel Performance Theory Shawfeng Dong shaw@ucsc.edu (831) 502-7743 Applied Mathematics & Statistics University of California, Santa Cruz Outline

More information

Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization

Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Ernie Chan 2, Robert A. van de Geijn 2, and Field

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

Scientific Computing

Scientific Computing Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting

More information

Parallel Performance Theory - 1

Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis

More information

STATISTICAL PERFORMANCE

STATISTICAL PERFORMANCE STATISTICAL PERFORMANCE PROVISIONING AND ENERGY EFFICIENCY IN DISTRIBUTED COMPUTING SYSTEMS Nikzad Babaii Rizvandi 1 Supervisors: Prof.Albert Y.Zomaya Prof. Aruna Seneviratne OUTLINE Introduction Background

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

Computational Linear Algebra

Computational Linear Algebra Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 2: Direct Methods PD Dr.

More information

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints Energy-efficient Mapping of Big Data Workflows under Deadline Constraints Presenter: Tong Shu Authors: Tong Shu and Prof. Chase Q. Wu Big Data Center Department of Computer Science New Jersey Institute

More information

Incomplete Cholesky preconditioners that exploit the low-rank property

Incomplete Cholesky preconditioners that exploit the low-rank property anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost

More information

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

arxiv: v1 [cs.dc] 19 Nov 2016

arxiv: v1 [cs.dc] 19 Nov 2016 A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting arxiv:1611.06365v1 [cs.dc] 19 Nov 2016 Sandra Catalán a, José R. Herrero b, Enrique S. Quintana-Ortí

More information

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Direct and Incomplete Cholesky Factorizations with Static Supernodes Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

How to deal with uncertainties and dynamicity?

How to deal with uncertainties and dynamicity? How to deal with uncertainties and dynamicity? http://graal.ens-lyon.fr/ lmarchal/scheduling/ 19 novembre 2012 1/ 37 Outline 1 Sensitivity and Robustness 2 Analyzing the sensitivity : the case of Backfilling

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Tile QR Factorization with Parallel Panel Processing for Multicore Architectures Bilel Hadri, Hatem Ltaief, Emmanuel Agullo, Jack Dongarra Department of Electrical Engineering and Computer Science, University

More information

Resilient and energy-aware algorithms

Resilient and energy-aware algorithms Resilient and energy-aware algorithms Anne Benoit ENS Lyon Anne.Benoit@ens-lyon.fr http://graal.ens-lyon.fr/~abenoit CR02-2016/2017 Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 1/

More information

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008

More information

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU LU Factorization To further improve the efficiency of solving linear systems Factorizations of matrix A : LU and QR LU Factorization Methods: Using basic Gaussian Elimination (GE) Factorization of Tridiagonal

More information

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems Noname manuscript No. (will be inserted by the editor) Power Profiling of Cholesky and QR Factorizations on Distributed s George Bosilca Hatem Ltaief Jack Dongarra Received: date / Accepted: date Abstract

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 37, No. 2, pp. C169 C193 c 2015 Society for Industrial and Applied Mathematics FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper

More information

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems Jos M. Badía 1, Peter Benner 2, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, Gregorio Quintana-Ortí 1, A. Remón 1 1 Depto.

More information

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms Peter Benner 1, Pablo Ezzatti 2, Hermann Mena 3, Enrique S. Quintana-Ortí 4, Alfredo Remón 4 1 Fakultät für Mathematik,

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Performance and Fault Tolerance of Preconditioned Iterative Solvers on Low-Power ARM Architectures

Performance and Fault Tolerance of Preconditioned Iterative Solvers on Low-Power ARM Architectures Performance and Fault Tolerance of Preconditioned Iterative Solvers on Low-Power ARM Architectures Aliaga, J. I., Catalan, S., Chalios, C., Nikolopoulos, D. S., & Quintana-Orti, E. S. (2015). Performance

More information

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric

More information

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Energy-efficient scheduling

Energy-efficient scheduling Energy-efficient scheduling Guillaume Aupy 1, Anne Benoit 1,2, Paul Renaud-Goud 1 and Yves Robert 1,2,3 1. Ecole Normale Supérieure de Lyon, France 2. Institut Universitaire de France 3. University of

More information

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction Peter Benner 1, Ernesto Dufrechou 2, Pablo Ezzatti 2, Pablo Igounet 2, Enrique S. Quintana-Ortí 3, and Alfredo Remón

More information

Numerical Analysis Fall. Gauss Elimination

Numerical Analysis Fall. Gauss Elimination Numerical Analysis 2015 Fall Gauss Elimination Solving systems m g g m m g x x x k k k k k k k k k 3 2 1 3 2 1 3 3 3 2 3 2 2 2 1 0 0 Graphical Method For small sets of simultaneous equations, graphing

More information

Analytical Modeling of Parallel Programs. S. Oliveira

Analytical Modeling of Parallel Programs. S. Oliveira Analytical Modeling of Parallel Programs S. Oliveira Fall 2005 1 Scalability of Parallel Systems Efficiency of a parallel program E = S/P = T s /PT p Using the parallel overhead expression E = 1/(1 + T

More information

Using Postordering and Static Symbolic Factorization for Parallel Sparse LU

Using Postordering and Static Symbolic Factorization for Parallel Sparse LU Using Postordering and Static Symbolic Factorization for Parallel Sparse LU Michel Cosnard LORIA - INRIA Lorraine Nancy, France Michel.Cosnard@loria.fr Laura Grigori LORIA - Univ. Henri Poincaré Nancy,

More information

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn Today s class Linear Algebraic Equations LU Decomposition 1 Linear Algebraic Equations Gaussian Elimination works well for solving linear systems of the form: AX = B What if you have to solve the linear

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4 Linear Algebra Section. : LU Decomposition Section. : Permutations and transposes Wednesday, February 1th Math 01 Week # 1 The LU Decomposition We learned last time that we can factor a invertible matrix

More information

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222 Bilel Hadri 1, Hatem Ltaief 1, Emmanuel Agullo 1, and Jack Dongarra 1,2,3 1 Department

More information

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II TDDB68 Concurrent programming and operating systems Lecture: CPU Scheduling II Mikael Asplund, Senior Lecturer Real-time Systems Laboratory Department of Computer and Information Science Copyright Notice:

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February

More information

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel Department of EECS, UC Berkeley February,

More information

Introduction to Mathematical Programming

Introduction to Mathematical Programming Introduction to Mathematical Programming Ming Zhong Lecture 6 September 12, 2018 Ming Zhong (JHU) AMS Fall 2018 1 / 20 Table of Contents 1 Ming Zhong (JHU) AMS Fall 2018 2 / 20 Solving Linear Systems A

More information

Special Nodes for Interface

Special Nodes for Interface fi fi Special Nodes for Interface SW on processors Chip-level HW Board-level HW fi fi C code VHDL VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster

More information

Computation of the mtx-vec product based on storage scheme on vector CPUs

Computation of the mtx-vec product based on storage scheme on vector CPUs BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures A parallel tiled solver for dense symmetric indefinite systems on multicore architectures Marc Baboulin, Dulceneia Becker, Jack Dongarra INRIA Saclay-Île de France, F-91893 Orsay, France Université Paris

More information

Module 5: CPU Scheduling

Module 5: CPU Scheduling Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation 5.1 Basic Concepts Maximum CPU utilization obtained

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Solving linear systems (6 lectures)

Solving linear systems (6 lectures) Chapter 2 Solving linear systems (6 lectures) 2.1 Solving linear systems: LU factorization (1 lectures) Reference: [Trefethen, Bau III] Lecture 20, 21 How do you solve Ax = b? (2.1.1) In numerical linear

More information

Chapter 6: CPU Scheduling

Chapter 6: CPU Scheduling Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation 6.1 Basic Concepts Maximum CPU utilization obtained

More information

(17) (18)

(17) (18) Module 4 : Solving Linear Algebraic Equations Section 3 : Direct Solution Techniques 3 Direct Solution Techniques Methods for solving linear algebraic equations can be categorized as direct and iterative

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

Numerical Methods I: Numerical linear algebra

Numerical Methods I: Numerical linear algebra 1/3 Numerical Methods I: Numerical linear algebra Georg Stadler Courant Institute, NYU stadler@cimsnyuedu September 1, 017 /3 We study the solution of linear systems of the form Ax = b with A R n n, x,

More information

ENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING

ENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING ENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING Abhishek Mishra and Anil Kumar Tripathi Department of

More information

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Energy-aware scheduling for GreenIT in large-scale distributed systems

Energy-aware scheduling for GreenIT in large-scale distributed systems Energy-aware scheduling for GreenIT in large-scale distributed systems 1 PASCAL BOUVRY UNIVERSITY OF LUXEMBOURG GreenIT CORE/FNR project Context and Motivation Outline 2 Problem Description Proposed Solution

More information

Fine-grained Parallel Incomplete LU Factorization

Fine-grained Parallel Incomplete LU Factorization Fine-grained Parallel Incomplete LU Factorization Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Sparse Days Meeting at CERFACS June 5-6, 2014 Contribution

More information

Optimized LU-decomposition with Full Pivot for Small Batched Matrices S3069

Optimized LU-decomposition with Full Pivot for Small Batched Matrices S3069 Optimized LU-decomposition with Full Pivot for Small Batched Matrices S369 Ian Wainwright High Performance Consulting Sweden ian.wainwright@hpcsweden.se Based on work for GTC 212: 1x speed-up vs multi-threaded

More information

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and

More information

Parallel Circuit Simulation. Alexander Hayman

Parallel Circuit Simulation. Alexander Hayman Parallel Circuit Simulation Alexander Hayman In the News LTspice IV features multi-threaded solvers designed to better utilize current multicore processors. Also included are new SPARSE matrix solvers

More information

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix EE507 - Computational Techniques for EE 7. LU factorization Jitkomut Songsiri factor-solve method LU factorization solving Ax = b with A nonsingular the inverse of a nonsingular matrix LU factorization

More information

CPU SCHEDULING RONG ZHENG

CPU SCHEDULING RONG ZHENG CPU SCHEDULING RONG ZHENG OVERVIEW Why scheduling? Non-preemptive vs Preemptive policies FCFS, SJF, Round robin, multilevel queues with feedback, guaranteed scheduling 2 SHORT-TERM, MID-TERM, LONG- TERM

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

CPU Scheduling. CPU Scheduler

CPU Scheduling. CPU Scheduler CPU Scheduling These slides are created by Dr. Huang of George Mason University. Students registered in Dr. Huang s courses at GMU can make a single machine readable copy and print a single copy of each

More information

LSN 15 Processor Scheduling

LSN 15 Processor Scheduling LSN 15 Processor Scheduling ECT362 Operating Systems Department of Engineering Technology LSN 15 Processor Scheduling LSN 15 FCFS/FIFO Scheduling Each process joins the Ready queue When the current process

More information

Feeding of the Thousands. Leveraging the GPU's Computing Power for Sparse Linear Algebra

Feeding of the Thousands. Leveraging the GPU's Computing Power for Sparse Linear Algebra SPPEXA Annual Meeting 2016, January 25 th, 2016, Garching, Germany Feeding of the Thousands Leveraging the GPU's Computing Power for Sparse Linear Algebra Hartwig Anzt Sparse Linear Algebra on GPUs Inherently

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

1.Chapter Objectives

1.Chapter Objectives LU Factorization INDEX 1.Chapter objectives 2.Overview of LU factorization 2.1GAUSS ELIMINATION AS LU FACTORIZATION 2.2LU Factorization with Pivoting 2.3 MATLAB Function: lu 3. CHOLESKY FACTORIZATION 3.1

More information

Scheduling with Constraint Programming. Job Shop Cumulative Job Shop

Scheduling with Constraint Programming. Job Shop Cumulative Job Shop Scheduling with Constraint Programming Job Shop Cumulative Job Shop CP vs MIP: Task Sequencing We need to sequence a set of tasks on a machine Each task i has a specific fixed processing time p i Each

More information

More Science per Joule: Bottleneck Computing

More Science per Joule: Bottleneck Computing More Science per Joule: Bottleneck Computing Georg Hager Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg Germany PPAM 2013 September 9, 2013 Warsaw, Poland Motivation (1): Scalability

More information

Reclaiming the energy of a schedule: models and algorithms

Reclaiming the energy of a schedule: models and algorithms Author manuscript, published in "Concurrency and Computation: Practice and Experience 25 (2013) 1505-1523" DOI : 10.1002/cpe.2889 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.:

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Direct Methods Philippe B. Laval KSU Fall 2017 Philippe B. Laval (KSU) Linear Systems: Direct Solution Methods Fall 2017 1 / 14 Introduction The solution of linear systems is one

More information

Parallel Programming. Parallel algorithms Linear systems solvers

Parallel Programming. Parallel algorithms Linear systems solvers Parallel Programming Parallel algorithms Linear systems solvers Terminology System of linear equations Solve Ax = b for x Special matrices Upper triangular Lower triangular Diagonally dominant Symmetric

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization Numerical Methods I Solving Square Linear Systems: GEM and LU factorization Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 18th,

More information