Saving Energy in Sparse and Dense Linear Algebra Computations

Similar documents
Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Power-Aware Execution of Sparse and Dense Linear Algebra Libraries

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Binding Performance and Power of Dense Linear Algebra Operations

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Tools for Power and Energy Analysis of Parallel Scientific Applications

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Communication-avoiding LU and QR factorizations for multicore architectures

Parallel Performance Theory

Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization

Level-3 BLAS on a GPU

Scientific Computing

Parallel Performance Theory - 1

STATISTICAL PERFORMANCE

Lightweight Superscalar Task Execution in Distributed Memory

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Computational Linear Algebra

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

Incomplete Cholesky preconditioners that exploit the low-rank property

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

arxiv: v1 [cs.dc] 19 Nov 2016

Enhancing Scalability of Sparse Direct Methods

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Contents. Preface... xi. Introduction...

How to deal with uncertainties and dynamicity?

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Resilient and energy-aware algorithms

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

c 2015 Society for Industrial and Applied Mathematics

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Performance and Fault Tolerance of Preconditioned Iterative Solvers on Low-Power ARM Architectures

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Solving PDEs with CUDA Jonathan Cohen

Energy-efficient scheduling

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

Numerical Analysis Fall. Gauss Elimination

Analytical Modeling of Parallel Programs. S. Oliveira

Using Postordering and Static Symbolic Factorization for Parallel Sparse LU

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

On the design of parallel linear solvers for large scale problems

Communication avoiding parallel algorithms for dense matrix factorizations

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Introduction to Mathematical Programming

Special Nodes for Interface

Computation of the mtx-vec product based on storage scheme on vector CPUs

Preconditioned Parallel Block Jacobi SVD Algorithm

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

Module 5: CPU Scheduling

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Solving linear systems (6 lectures)

Chapter 6: CPU Scheduling

(17) (18)

ERLANGEN REGIONAL COMPUTING CENTER

Numerical Methods I: Numerical linear algebra

ENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Computing least squares condition numbers on hybrid multicore/gpu systems

Energy-aware scheduling for GreenIT in large-scale distributed systems

Fine-grained Parallel Incomplete LU Factorization

Optimized LU-decomposition with Full Pivot for Small Batched Matrices S3069

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Parallel Circuit Simulation. Alexander Hayman

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

CPU SCHEDULING RONG ZHENG

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

CPU Scheduling. CPU Scheduler

LSN 15 Processor Scheduling

Feeding of the Thousands. Leveraging the GPU's Computing Power for Sparse Linear Algebra

Dense Arithmetic over Finite Fields with CUMODP

Sparse BLAS-3 Reduction

1.Chapter Objectives

Scheduling with Constraint Programming. Job Shop Cumulative Job Shop

More Science per Joule: Bottleneck Computing

Reclaiming the energy of a schedule: models and algorithms

Numerical Linear Algebra

Parallel Programming. Parallel algorithms Linear systems solvers

Some notes on efficient computing and setting up high performance computing environments

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization

Transcription:

Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain de Castellón, Spain at Austin, TX quintana@icc.uji.es Saving Energy in Sparse and Dense Linear Algebra Computations 1 SIAM PPSC Savannah (GA), Feb. 2012

Introduction and motivation Motivation Reduce energy consumption! Costs over lifetime of an HPC facility often exceed acquisition costs Carbon dioxide is a risk for health and environment Heat reduces hardware reliability Personal view Hardware features mechanisms and modes to save energy Scientific apps. are in general energy-oblivious Saving Energy in Sparse and Dense Linear Algebra Computations 2 SIAM PPSC Savannah (GA), Feb. 2012

Introduction and motivation Motivation Reduce energy consumption! Costs over lifetime of an HPC facility often exceed acquisition costs Carbon dioxide is a risk for health and environment Heat reduces hardware reliability Personal view Hardware features mechanisms and modes to save energy Scientific apps. are in general energy-oblivious Saving Energy in Sparse and Dense Linear Algebra Computations 2 SIAM PPSC Savannah (GA), Feb. 2012

Introduction and motivation Motivation (Cont d) Scheduling of task parallel linear algebra algorithms Examples: Cholesky, QR and LU factorizations, etc. ILUPACK Energy saving tools available for multi-core processors Example: Dynamic Voltage and Frequency Scaling (DVFS) Scheduling tasks + DVFS Energy-aware execution on multi-core processors Different scheduling strategies: SRA: Reduce the frequency of cores that will execute non-critical tasks to decrease idle times without sacrificing total performance of the algorithm RIA: Execute all tasks at highest frequency to enjoy longer inactive periods Saving Energy in Sparse and Dense Linear Algebra Computations 3 SIAM PPSC Savannah (GA), Feb. 2012

Introduction and motivation Motivation (Cont d) Scheduling of task parallel linear algebra algorithms Examples: Cholesky, QR and LU factorizations, etc. ILUPACK Energy saving tools available for multi-core processors Example: Dynamic Voltage and Frequency Scaling (DVFS) Scheduling tasks + DVFS Energy-aware execution on multi-core processors Different scheduling strategies: SRA: Reduce the frequency of cores that will execute non-critical tasks to decrease idle times without sacrificing total performance of the algorithm RIA: Execute all tasks at highest frequency to enjoy longer inactive periods Saving Energy in Sparse and Dense Linear Algebra Computations 3 SIAM PPSC Savannah (GA), Feb. 2012

Introduction and motivation Outline 1 LU factorization with partial pivoting 1 Slack Reduction Algorithm 2 Race-to-Idle Algorithm 3 Simulation 4 Experimental Results 2 ILUPACK 1 Race-to-Idle Algorithm 2 Experimental Results 3 Conclusions Saving Energy in Sparse and Dense Linear Algebra Computations 4 SIAM PPSC Savannah (GA), Feb. 2012

1. LU Factorization with Partial Pivoting (LUPP) LU factorization Factor A = LU, with L/U R n n unit lower/upper triangular matrices For numerical stability, permutations are introduced to prevent operation with small pivot elements PA = LU, with P R n n a permutation matrix that interchanges rows of A Saving Energy in Sparse and Dense Linear Algebra Computations 5 SIAM PPSC Savannah (GA), Feb. 2012

Blocked algorithm for LUPP (no PP, for simplicity) for k = 1 : s do A k:s,k = L k:s,k U kk for j = k + 1 : s do A kj L 1 kk A kj A k+1:s,j A k+1:s,j A k+1:s,k A kj end for end for LU factorization Triangular solve Matrix-matrix product DAG with a matrix consisting of 5 5 blocks (s=5) T 51 M 51 T 52 M 52 T 21 M 21 G 22 G 11 T 32 M 32 G 33 T 53 M 53 T 54 M 54 G 55 T 31 M 31 T 43 M 43 G 44 T 41 M 41 T 42 M 42 Saving Energy in Sparse and Dense Linear Algebra Computations 6 SIAM PPSC Savannah (GA), Feb. 2012

Blocked algorithm for LUPP (no PP, for simplicity) for k = 1 : s do A k:s,k = L k:s,k U kk for j = k + 1 : s do A kj L 1 kk A kj A k+1:s,j A k+1:s,j A k+1:s,k A kj end for end for LU factorization Triangular solve Matrix-matrix product DAG with a matrix consisting of 5 5 blocks (s=5) T 51 M 51 T 52 M 52 T 21 M 21 G 22 G 11 T 32 M 32 G 33 T 53 M 53 T 54 M 54 G 55 T 31 M 31 T 43 M 43 G 44 T 41 M 41 T 42 M 42 Saving Energy in Sparse and Dense Linear Algebra Computations 6 SIAM PPSC Savannah (GA), Feb. 2012

Slack Reduction Algorithm 1.1 Slack Reduction Algorithm Strategy Search for slacks (idle periods) in the DAG associated with the algorithm, and try to minimize them applying e.g. DVFS 1 Search slacks via the Critical Path Method (CPM): DAG of dependencies Nodes Tasks Edges Dependencies ES i /LF i : Early and latest times task T i, with cost C i, can start/finalize without increasing the total execution time of the algorithm S i : Slack (time) task T i can be delayed without increasing the total execution time Critical path: Collection of tasks directly connected, from initial to final node of the graph, with total slack = 0 2 Minimize the slack of tasks with S i > 0, reducing their execution frequency via SRA Saving Energy in Sparse and Dense Linear Algebra Computations 7 SIAM PPSC Savannah (GA), Feb. 2012

Slack Reduction Algorithm 1.1 Slack Reduction Algorithm Strategy Search for slacks (idle periods) in the DAG associated with the algorithm, and try to minimize them applying e.g. DVFS 1 Search slacks via the Critical Path Method (CPM): DAG of dependencies Nodes Tasks Edges Dependencies ES i /LF i : Early and latest times task T i, with cost C i, can start/finalize without increasing the total execution time of the algorithm S i : Slack (time) task T i can be delayed without increasing the total execution time Critical path: Collection of tasks directly connected, from initial to final node of the graph, with total slack = 0 2 Minimize the slack of tasks with S i > 0, reducing their execution frequency via SRA Saving Energy in Sparse and Dense Linear Algebra Computations 7 SIAM PPSC Savannah (GA), Feb. 2012

Slack Reduction Algorithm Application of CPM to the DAG of the LUPP of a 5 5 blocked matrix: T 51 M 51 T 52 M 52 T 21 T 53 M 53 M 21 G 22 G 11 T 32 M 32 G 33 T 31 T 43 M 43 G 44 M 31 T 42 M 42 M 41 T 41 T 54 M 54 G 55 Task C ES LF S G 11 109.011 0.000 109.011 0 T 21 30.419 109.011 139.430 0 T 41 30.419 109.011 287.374 147.944 T 51 30.419 109.011 326.306 186.877 T 31 30.419 109.011 225.082 85.652... Duration of tasks (cost, C) of type G and M depends on the iteration! Evaluate the time of 1 flop for each type of task and, from its theoretical cost, approximate the execution time.. Saving Energy in Sparse and Dense Linear Algebra Computations 8 SIAM PPSC Savannah (GA), Feb. 2012

Slack Reduction Algorithm Slack Reduction Algorithm 1 Frequency assignment: Set initial frequencies 2 Critical subpath extraction 3 Slack reduction 1 Frequency assignment G 11 T 21 T 31 T 41 T 51 M 21 M 31 M 51 G 22 M 41 T 52 T 32 T 42 M 32 M 42 M 52 T 53 M 53 G 33 T 43 M 43 G 44 T 54 M 54 G 55 Discrete range of frequencies: {, 1.50, 1.20, 1.00, 0.80} GHz Duration of tasks of type G and M depends on the iteration. Initially, all tasks to run at the highest frequency: GHz Saving Energy in Sparse and Dense Linear Algebra Computations 9 SIAM PPSC Savannah (GA), Feb. 2012

Slack Reduction Algorithm Slack Reduction Algorithm 1 Frequency assignment 2 Critical subpath extraction: Identify critical subpaths 3 Slack reduction 2 Critical subpath extraction G 11 T 21 T 31 T 41 T 51 M 21 M 31 M 51 G 22 M 41 T 52 T 32 T 42 M 32 M 42 M 52 T 53 M 53 G 33 T 43 M 43 G 44 T 54 M 54 G 55 Critical subpath extraction: 1 Identify and extract the critical (sub)path(s) 2 Eliminate the graph nodes/edges that belong to it 3 Repeat process until the graph is empty Results: Critical path: CP 0 3 critical subpaths (from largest to shortest): CP 1 > CP 2 > CP 3 Saving Energy in Sparse and Dense Linear Algebra Computations 10 SIAM PPSC Savannah (GA), Feb. 2012

Slack Reduction Algorithm Slack Reduction Algorithm 1 Frequency assignment 2 Critical subpath extraction 3 Slack reduction: Determine execution frequency for each task 3 Slack reduction G 11 T 21 T 31 1.50 T 41 1.50 T 51 1.50 M 21 M 31 1.50 M 51 1.50 G 22 M 41 1.50 Slack reduction algorithm: T 52 1.50 T 32 T 42 1.50 M 32 M 42 1.50 M 52 1.50 T 53 1.50 M 53 1.50 G 33 T 43 M 43 G 44 T 54 1 Reduce frequency of tasks of the largest unprocessed subpath 2 Check the lowest frequency reduction ratio for each task of that subpath 3 Repeat process until all subpaths are processed M 54 G 55 Results: Tasks of CSP 1, CSP 2 and CSP 3 are assigned to run at 1.5 GHz! Saving Energy in Sparse and Dense Linear Algebra Computations 11 SIAM PPSC Savannah (GA), Feb. 2012

Race-to-Idle Algorithm 1.2 Race-to-Idle Algorithm Race-to-Idle complete execution as soon as possible by executing tasks of the algorithm at the highest frequency to enjoy longer inactive periods Tasks are executed at highest frequency During idle periods CPU frequency is reduced to lowest possible Why? Current processors are quite efficient at saving power when idle Power of an idle core is much lower than power during working periods DAG requires no processing, unlike SRA Saving Energy in Sparse and Dense Linear Algebra Computations 12 SIAM PPSC Savannah (GA), Feb. 2012

Simulation 1.3 Simulation Use of a simulator to evaluate the performance of the two strategies Input parameters: DAG capturing tasks and dependencies of a blocked algorithm and frequencies recommended by the Slack Reduction Algorithm and Race-to-Idle Algorithm A simple description of the target architecture: Number of sockets (physical processors) Number of cores per socket Discrete range of frequencies and associated voltages Frequency changes at core/socket level Collection of real power for each combination of frequency idle/busy state per core Cost (overhead) required to perform frequency changes Static priority list scheduler: Duration of tasks at each available frequency is known in advance Tasks that lie on critical path are prioritized Saving Energy in Sparse and Dense Linear Algebra Computations 13 SIAM PPSC Savannah (GA), Feb. 2012

Simulation AMD Opteron 6128 (8 cores): Evaluate the time of 1 flop for each type of task and, from its theoretical cost, approximate the execution time Frequency changes at core level, with f {, 1.50, 1.20, 1.00, 0.80} GHz Blocked algorithm for LUPP: Metrics: Simulation independent of actual implementation (LAPACK, libflame, etc.) Matrix size n from 768 to 5,632; block size: b = 256 Compare simulated time and energy consumption of original algorithm with those of modified SRA/RIA algorithms Simulated time T SRA/RIA T original Impact of SRA/RIA on time IT = T SRA/RIA T original 100 Simulated energy E SRA/RIA = P n i=1 Wn Tn E original = v 2 T(f max) Impact of SRA/RIA on energy IE = E SRA/RIA E original 100 Saving Energy in Sparse and Dense Linear Algebra Computations 14 SIAM PPSC Savannah (GA), Feb. 2012

Simulation Impact of SRA/RIA on simulated time/energy for LUPP: only power/energy due to workload (application)! 1.2 1.15 RIA SRA Time 1.2 1.15 Energy RIA SRA 1.1 1.1 % Impact on time 1.05 1 0.95 0.9 % Impact on energy 1.05 1 0.95 0.9 0.85 0.85 0.8 1024 1280 1536 1792 2048 2304 2560 2816 3072 3328 3584 3840 4096 4352 4608 4864 5120 5376 5632 Matrix size (n) 0.8 768 1024 1280 1536 1792 2048 2304 2560 2816 3072 3328 3584 3840 4096 4352 4608 4864 5120 5376 5632 Matrix size (n) SRA: Time is compromised, increasing the consumption for largest problem sizes Increase in execution time due to SRA being oblivious to the actual resources RIA: Time is not compromised and consumption is reduced for large problem sizes Saving Energy in Sparse and Dense Linear Algebra Computations 15 SIAM PPSC Savannah (GA), Feb. 2012

Simulation Impact of SRA/RIA on simulated time/energy for LUPP: only power/energy due to workload (application)! 1.2 1.15 RIA SRA Time 1.2 1.15 Application energy RIA SRA % Impact on time 1.1 1.05 1 0.95 0.9 0.85 % Impact on application energy 1.1 1.05 1 0.95 0.9 0.85 0.8 1024 1280 1536 1792 2048 2304 2560 2816 3072 3328 3584 3840 4096 4352 4608 4864 5120 5376 5632 Matrix size (n) 0.8 768 1024 1280 1536 1792 2048 2304 2560 2816 3072 3328 3584 3840 4096 4352 4608 4864 5120 5376 5632 Matrix size (n) SRA: Time is compromised, increasing the consumption for largest problem sizes Increase in execution time due to the SRA being oblivious to the actual resources RIA: Time is not compromised and consumption is reduced for large problem sizes Saving Energy in Sparse and Dense Linear Algebra Computations 16 SIAM PPSC Savannah (GA), Feb. 2012

1.4 Experimental Results LUPP Experimental Results Integrate RIA into a runtime for task-parallel execution of dense linear algebra algorithms: libflame+supermatrix Algorithm Worker Th. 1 Core 1 Symbolic Analysis Queue of pending tasks + dependencies (DAG) Dispatch Queue of ready tasks (no dependencies) Worker Th. 2. Worker Th. p Core 2. Core p Two energy-aware techniques: RIA1: Reduce operation frequency when there are no ready tasks (Linux governor) RIA2: Remove polling when there are no ready tasks (while ensuring a quick recovery) Applicable to any runtime: SuperMatrix, SMPSs, Quark, etc! Saving Energy in Sparse and Dense Linear Algebra Computations 17 SIAM PPSC Savannah (GA), Feb. 2012

Experimental Results Doing nothing well David E. Culler AMD Opteron 6128, 1 core: 120 110 100 Power for different thread activities MKL dgemm at GHz Blocking at 800 MHz Polling at GHz Polling at 800 MHz Power (Watts) 90 80 70 60 50 40 0 5 10 15 20 25 30 Time (s) Saving Energy in Sparse and Dense Linear Algebra Computations 18 SIAM PPSC Savannah (GA), Feb. 2012

Experimental Results AMD Opteron 6128 (8 cores): Frequency changes at core level, with fmax = GHz and f min = 800 MHz Blocked algorithm for LUPP: Routine FLA LU piv from libflame v5.0 Metrics: Matrix size n from 2,048 to 12,288; block size: b = 256 Compare actual time and energy consumption of original runtime with those of modified RIA1/RIA2 runtime Actual time T RIA T original Impact of RIA on time IT = T RIA T original 100 Actual energy E RIA and E original measured using internal powermeter with sampling f =25 Hz Impact of RIA on energy IE = E RIA E original 100 Saving Energy in Sparse and Dense Linear Algebra Computations 19 SIAM PPSC Savannah (GA), Feb. 2012

Experimental Results Impact of RIA1/RIA2 on actual time/energy for LUPP: only power/energy due to workload (application)! 1.2 1.15 RIA1 RIA2 RIA1+RIA2 Time 1.2 1.15 RIA1 RIA2 RIA1+RIA2 Energy % Impact on time 1.1 1.05 1 0.95 0.9 % Impact on energy 1.1 1.05 1 0.95 0.9 0.85 0.85 0.8 0.8 2048 3072 4096 5120 6144 7168 8192 9216 10240 11264 12288 2048 3072 4096 5120 6144 7168 8192 9216 10240 11264 12288 Matrix size (n) Matrix size (n) Small impact on execution time Consistent savings around 5 % for RIA1+RIA2: Not much hope for larger savings because there is no opportunity for that: the cores are busy most of the time Saving Energy in Sparse and Dense Linear Algebra Computations 20 SIAM PPSC Savannah (GA), Feb. 2012

Experimental Results Impact of RIA1/RIA2 on actual time/energy for LUPP: only power/energy due to workload (application)! % Impact on time 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 RIA1 RIA2 RIA1+RIA2 Time % Impact on application energy 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 RIA1 RIA2 RIA1+RIA2 Application energy 0.8 0.8 2048 3072 4096 5120 6144 7168 8192 9216 10240 11264 12288 2048 3072 4096 5120 6144 7168 8192 9216 10240 11264 12288 Matrix size (n) Matrix size (n) Small impact on execution time Consistent savings around 7 8% for RIA1+RIA2: Not much hope for larger savings because there is no opportunity for that: the cores are busy most of the time Saving Energy in Sparse and Dense Linear Algebra Computations 21 SIAM PPSC Savannah (GA), Feb. 2012

ILUPACK 2. ILUPACK Sequential version M. Bollhöfer Iterative solver for large-scale sparse linear systems Multilevel ILU preconditioners for general and complex symmetric/hermitian positive definite systems Based on inverse-based ILUs: incomplete LU decompositions that control the growth of the inverse triangular factors Multi-threaded version J.I. Aliaga, M. Bollhöfer, A. F. Martín, E. S. Quintana-Ortí Real symmetric positive definite systems Construction of preconditioner and PCG solver Algebraic parallelization based on a task tree Leverage task-parallelism in the tree Dynamic scheduling via tailored run-time (OpenMP) Saving Energy in Sparse and Dense Linear Algebra Computations 22 SIAM PPSC Savannah (GA), Feb. 2012

ILUPACK PA A GA (1,2) (3,1) (1,4) 000 111 000 111 (2,1) (2,2) TT (3,1) (1,1) (3,1) (1,3) (2,1) (2,2) (1,1) (1,2) (1,3) (1,4) Saving Energy in Sparse and Dense Linear Algebra Computations 23 SIAM PPSC Savannah (GA), Feb. 2012

ILUPACK Integrate RIA into multi-threaded version of ILUPACK Problem data: A Worker Th. 1 Core 1 Partitioning: METIS/SCOTCH Queue of pending tasks (binary tree). Worker Th. 2. Worker Th. p Core 2. Core p Queues of ready tasks (no dependencies) Same energy-aware techniques: RIA1+RIA2: Reduce operation frequency when there are no ready tasks (Linux governor) and remove polling when there are no ready tasks (while ensuring a quick recovery) Saving Energy in Sparse and Dense Linear Algebra Computations 24 SIAM PPSC Savannah (GA), Feb. 2012

ILUPACK AMD Opteron 6128 (8 cores) Linear system associated with Laplacian equation (n 16M) Impact of RIA1+RIA2 on actual power/application power for ILUPACK (only preconditioner): % Impact on power % Impact on application power Power 1.6 RIA1+RIA2 1.4 1.2 1 0.8 0.6 0.4 0 200 400 600 800 1000 1200 1400 1600 1800 Sample Application power 1.6 RIA1+RIA2 1.4 1.2 1 0.8 0.6 0.4 0 200 400 600 800 1000 1200 1400 1600 1800 Sample Saving Energy in Sparse and Dense Linear Algebra Computations 25 SIAM PPSC Savannah (GA), Feb. 2012

3. Conclusions Conclusions Goal: Combine scheduling+dvfs to save energy during the execution of linear algebra algorithms on multi-threaded architectures For (dense) LUPP: Slack Reduction Algorithm DAG requires a processing Currently does not take into account number of resources Increases execution time as matrix size grows Increases also energy consumption Race-to-Idle Algorithm Algorithm is applied on-the-fly, no pre-processing needed Maintains in all of cases execution time Reduce application energy consumption (around 7 8%) As the ratio between the problem size and the number of resources grows, the opportunities to save energy decrease! Saving Energy in Sparse and Dense Linear Algebra Computations 26 SIAM PPSC Savannah (GA), Feb. 2012

Conclusions For (sparse) ILUPACK: Race-to-Idle Algorithm DAG requires no processing Algorithm is applied on-the-fly Maintains in all of cases execution time Reduce power dissipated by application up to 40% Significant opportunities to save energy! General: We need architectures that know how to do nothing better! Saving Energy in Sparse and Dense Linear Algebra Computations 27 SIAM PPSC Savannah (GA), Feb. 2012

More information Improving power-efficiency of dense linear algebra algorithms on multi-core processors via slack control P. Alonso, M. F. Dolz, R. Mayo, E. S. Quintana-Ortí Workshop on Optimization Issues in Energy Efficient Distributed Systems OPTIM 2011, Istanbul (Turkey), July 2011 DVFS-control techniques for dense linear algebra operations on multi-core processors P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí 2nd Int. Conf. on Energy-Aware High Performance Computing EnaHPC 2011, Hamburg (Germany), Sept. 2011 Saving energy in the LU factorization with partial pivoting on multi-core processors P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí 20th Euromicro Conf. on Parallel, Distributed and Network based Processing PDP 2012, Garching (Germany). Feb. 2012 Saving Energy in Sparse and Dense Linear Algebra Computations 28 SIAM PPSC Savannah (GA), Feb. 2012

Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain de Castellón, Spain at Austin, TX quintana@icc.uji.es Saving Energy in Sparse and Dense Linear Algebra Computations 29 SIAM PPSC Savannah (GA), Feb. 2012