Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Similar documents
Saving Energy in Sparse and Dense Linear Algebra Computations

Binding Performance and Power of Dense Linear Algebra Operations

Improving Performance and Energy Consumption of Runtime Schedulers for Dense Linear Algebra

Power-Aware Execution of Sparse and Dense Linear Algebra Libraries

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Level-3 BLAS on a GPU

Tools for Power and Energy Analysis of Parallel Scientific Applications

Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization

arxiv: v1 [cs.dc] 19 Nov 2016

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Communication-avoiding LU and QR factorizations for multicore architectures

CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

Intel Math Kernel Library (Intel MKL) LAPACK

Week6. Gaussian Elimination. 6.1 Opening Remarks Solving Linear Systems. View at edx

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

Exercise 1.3 Work out the probabilities that it will be cloudy/rainy the day after tomorrow.

9. Numerical linear algebra background

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Simulation of Process Scheduling Algorithms

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Notes on LU Factorization

CPU Scheduling. CPU Scheduler

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Real-Time and Embedded Systems (M) Lecture 5

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

CEC 450 Real-Time Systems

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Eliminations and echelon forms in exact linear algebra

LSN 15 Processor Scheduling

More Gaussian Elimination and Matrix Inversion

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

UC Santa Barbara. Operating Systems. Christopher Kruegel Department of Computer Science UC Santa Barbara

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

Some notes on efficient computing and setting up high performance computing environments

CPU scheduling. CPU Scheduling

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Restructuring the Symmetric QR Algorithm for Performance. Field Van Zee Gregorio Quintana-Orti Robert van de Geijn

More on Matrix Inversion

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

A Compiler for Linear Algebra Operations

CSE 380 Computer Operating Systems

Updating an LU factorization with Pivoting. FLAME Working Note #21

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

9. Numerical linear algebra background

Direct Self-Consistent Field Computations on GPU Clusters

Solving PDEs with CUDA Jonathan Cohen

Matrix-Matrix Multiplication

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

ENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Supporting Intra-Task Parallelism in Real- Time Multiprocessor Systems José Fonseca

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Communication avoiding parallel algorithms for dense matrix factorizations

REPRESENTATIONS FOR THE THEORY AND PRACTICE OF HIGH-PERFORMANCE DENSE LINEAR ALGEBRA ALGORITHMS. DRAFT October 23, 2007

2/5/07 CSE 30341: Operating Systems Principles

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

CHAPTER 5 - PROCESS SCHEDULING

Module 5: CPU Scheduling

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

Numerical Linear Algebra

Non-Preemptive and Limited Preemptive Scheduling. LS 12, TU Dortmund

Embedded Systems 14. Overview of embedded systems design

Chapter 6: CPU Scheduling

Energy-efficient scheduling

Matrix-Vector Operations

Polynomial Time Algorithms for Minimum Energy Scheduling

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Lightweight Superscalar Task Execution in Distributed Memory

On the design of parallel linear solvers for large scale problems

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Lecture Note #6: More on Task Scheduling EECS 571 Principles of Real-Time Embedded Systems Kang G. Shin EECS Department University of Michigan

The Science of Programming Matrix Computations

On the design of parallel linear solvers for large scale problems

On aggressive early deflation in parallel variants of the QR algorithm

Global Real-Time Semaphore Protocols: A Survey, Unified Analysis, and Comparison

LAPACK-Style Codes for Pivoted Cholesky and QR Updating

Parallel Algorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

Improvements for Implicit Linear Equation Solvers

Sparse BLAS-3 Reduction

Parallel Algorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem

Lecture 4: Process Management

Parallel Performance Theory

Notes on Solving Linear Least-Squares Problems

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Lab Course: distributed data analytics

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems

CIS 4930/6930: Principles of Cyber-Physical Systems

Transcription:

20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2, Francisco D. Igual 2, Rafael Mayo 2, Enrique S. Quintana-Ort 2 1 2 February 15 17, 2012, Garching bei München (Germany)

Motivation High performance computing: Optimization of algorithms applied to solve complex problems Technological advance improve performance: Higher number of cores per socket (processor) Large number of processors and cores High energy consumption Methods, algorithms and techniques to reduce energy consumption applied to high performance computing

Outline 1 Introduction 2 The right-looking algorithm Parallelization 3 EA1: Reduce operation frequency when there are no ready tasks EA2: Remove polling when there are no ready tasks 4 Environment setup Results 5

Introduction Scheduling tasks of dense linear algebra algorithms Examples: Cholesky, QR and LU factorizations Energy saving tools available for multi-core processors Example: Dynamic Voltage and Frequency Scaling (DVFS) Current strategies: Scheduling tasks + DVFS Power-aware scheduling on multi-core processors Slack Reduction : Reduce the frequency of cores that will execute non-critical tasks to decrease idle times without sacrifying total performance of the algorithm Race-to-idle : Execute all tasks at highest frequency to enjoy longer inactive periods Energy savings

Introduction Scheduling tasks of dense linear algebra algorithms Examples: Cholesky, QR and LU factorizations Energy saving tools available for multi-core processors Example: Dynamic Voltage and Frequency Scaling (DVFS) Current strategies: Scheduling tasks + DVFS Power-aware scheduling on multi-core processors Slack Reduction : Reduce the frequency of cores that will execute non-critical tasks to decrease idle times without sacrifying total performance of the algorithm Race-to-idle : Execute all tasks at highest frequency to enjoy longer inactive periods Energy savings

Dense linear algebra operations The right-looking algorithm Parallelization LU factorization with partial pivoting PA = LU A R n n P R n n L/U R n n nonsingular matrix permutation matrix unit lower/upper triangular matrices We consider a partitioning of matrix A into blocks of size b b For numerical stability, permutations are introduced to prevent operation with small pivot elements

The right-looking algorithm Parallelization LU factorization using FLAME notation n( ) stands for the number of columns of its argument trilu( ) denotes the matrix consisting to the elements in lower triangular with diagonal replaced by ones pivoting omitted for simplicity Some cost details Blocked algorithm performs 2n 3 /3 + O(n 2 ) flops Most of them cast in terms of gemm operations Algorithm: A := LUP blk(a) ( ) ATL A Partition A TR A BL A BR where A TL is 0 0 while n(a TL ) < n(a) do Determine block size b Repartition ( ) A ATL A 00 A 01 A 02 TR A A BL A 10 A 11 A 12 BR A 20 A 21 A 22 where A 11 is b b ( ) A11 A 21 ( ) A11 := LUP unb A 21 A 12 := trilu(a 11 ) 1 A 12 (trsm) A 22 := A 22 A 21 A 12 (gemm) Continue with ( ) A ATL A 00 A 01 A 02 TR A 10 A 11 A 12 A BL A BR A 20 A 21 A 22 endwhile

Parallelization The right-looking algorithm Parallelization Parallelization not trivial at code level! We use SuperMatrix runtime to execute libflame routines: Operations are not executed in the order they appear in code Control-flow Parallelism SuperMatrix schedules execution respecting data dependencies Data-flow parallelism SuperMatrix proceeds in two stages: 1 A symbolic execution produces a DAG containing dependencies 2 DAG dictates the feasible orderings in which task can be executed T 51 M 51 T 52 M 52 T 21 T 53 M 53 M 21 G 22 G 11 T 32 M 32 G 33 T 31 T 43 M 43 G 44 M 31 T 42 M 42 M 41 T 41 Figure: DAG with a matrix consisting of 5 5 blocks T 54 M 54 G 55

The right-looking algorithm Parallelization SuperMatrix runtime: Algorithm Worker Th. 1 Core 1 Symbolic Analysis Queue of pending tasks + dependencies (DAG) Dispatch Queue of ready tasks (no dependencies) Worker Th. 2. Worker Th. p Core 2. Core p Basic scheduling: 1 Initially only one a task in ready queue 2 A thread acquires a task of the ready queue and runs the corresponding job 3 Upon completion checks tasks which were in the pending queue moving to ready if their dependencies are satisfied. Problem! Idle threads (one per core) continuously check the ready list for work Busy-wait or polling Energy consumption!

The right-looking algorithm Parallelization SuperMatrix runtime: Algorithm Worker Th. 1 Core 1 Symbolic Analysis Queue of pending tasks + dependencies (DAG) Dispatch Queue of ready tasks (no dependencies) Worker Th. 2. Worker Th. p Core 2. Core p Basic scheduling: 1 Initially only one a task in ready queue 2 A thread acquires a task of the ready queue and runs the corresponding job 3 Upon completion checks tasks which were in the pending queue moving to ready if their dependencies are satisfied. Problem! Idle threads (one per core) continuously check the ready list for work Busy-wait or polling Energy consumption!

EA1: Reduce operation frequency when there are no ready tasks EA2: Remove polling when there are no ready tasks Energy-Aware Techniques into SuperMatrix Modern Linux distributions leverage DVFS: cpufreq-utils Increase/reduce operation frequency of cores Governors: control operation frequency Basic idea: performance frequency is always fixed to the highest ondemand frequency is controlled by cpufreq kernel module, sets the highest frequency when core is loaded and the lowest when core is idle userspace frequency is controlled by user Dependencies between tasks idle periods Idle periods can be exploited to save energy We consider Race-to-idle Algorithm. Why? Current processors are quite efficient at saving power when idle Power of idle core is much smaller than power in working periods

EA1: Reduce operation frequency when there are no ready tasks EA2: Remove polling when there are no ready tasks EA1: Reduce operation frequency when there are no ready tasks Objective Do a more efficient Race-to-Idle technique Is there a ready task for a polling thread in the ready list? Settings: No: runtime immediately sets the operation frequency of the associated core to the lowest possible Yes: frequency is raised back to the highest and task is executed by the thread Linux governor is set to userspace: frequency is controlled by SuperMatrix Experiments on an 6128 AMD (8-core): DVFS AMD PowerNow! technology allows to control frequency at core level!

EA1: Reduce operation frequency when there are no ready tasks EA2: Remove polling when there are no ready tasks EA2: Remove polling when there are no ready tasks Objective Replace busy-waits by idle-waits Is there a ready task for a polling thread in the ready list? No: the thread blocks/suspends itself in a POSIX semaphore by calling to sem wait() Yes: Settings: 1 Thread acquires and executes the task 2 Updates dependencies: moves k tasks from pending queue to ready queue 3 Thread activates k 1 sleeping threads to attend ready tasks by calling k 1 times to sem post() Linux governor is set to ondemand: highest frequency when working, and lowest when idle POSIX semaphores to allow suspension states for non working threads Idle-wait Low energy consumption

EA1: Reduce operation frequency when there are no ready tasks EA2: Remove polling when there are no ready tasks 120 110 100 Power for different thread activities MKL dgemm at 2.0 GHz Polling at 2.0 GHz Polling at 800 MHz Blocking at 800 MHz Power (Watts) 90 80 70 60 50 40 0 5 10 15 20 25 30 MKL dgemm at 2.0 GHz: 97-100 W Polling at 2.0 GHz: 93-97 W Polling at 800 MHz: 83-88 W Blocking at 800 MHz: 49-55 W Time (s)

Environment setup Results Environment setup: 8-core AMD 6128 processor (2.0 GHz) with 24 Gbytes of RAM Discrete collection of frequencies: {2.0, 1.5, 1.2, 1.0, 0.8} GHz Linux Ubuntu 10.04 BLAS and LAPACK implementations from MKL 10.2.4 SuperMatrix runtime in libflame v5.0-r5587 Two energy saving techniques: EA1+EA2 Benchmark algorithm: LU with partial pivoting FLASH LU piv routine blocked right-looking variant of LU factorization Block size: b = 512 Matrix size ranges from 2,048 to 12,288

Environment setup Environment setup Results Power measurements: Internal DC powermeter ASIC directly attached to the lines connecting the power supply unit and the motherboards (chipset+processors) Operation frequency 25 Hz 25 samples/sec. USB Computer GPU Measurement Software Internal Power Meter Power Supply Unit PCI Express Motherboard

Experiments Environment setup Results Evaluation: 1 We evaluate the existence and length of idle periods during the computation of the LU factorization 2 We measure the actual gains that can be attained by our energy-aware approaches with 3 versions of the SuperMatrix runtime: Original runtime (performance) EA1: Reduces operation frequency during polling periods (userspace) EA2: Integrates semaphores to block idle threads and avoid polling (ondemand) EA1+EA2 (userspace) Each experiment was repeated 30 times and average values of energy consumption were reported Maximum coefficient of variation was lower than 0.8 %

Environment setup Results Thread activity during the execution of the LU factorization with partial pivoting #Threads running concurrently % of total time 100 80 60 40 1 2 3 4 5 6 7 8 20 0 2048 3072 4096 5120 6144 7168 8192 9216 10240 11264 12288 Matrix size (n) For small problem sizes, 53 % of time there is a single active thread and only 9 % of time all threads are performing work For large problem sizes, 26 % of time there is a single active thread and only 74 % of time all threads are performing work

Environment setup Results Impact on and relative execution time of LU factorization with partial pivoting using the energy-aware techniques 35 30 25 20 15 10 5 0 2048 3072 4096 5120 6144 7168 8192 9216 10240 11264 12288 Time (s) Absolut time SuperMatrix SuperMatrix with EA1 SuperMatrix with EA2 SuperMatrix with EA1+EA2 % Impact on time 1.2 1.15 1.1 1.05 1 0.95 0.9 Relative time w.r.t. the original runtime SuperMatrix with EA1 SuperMatrix with EA2 SuperMatrix with EA1+EA2 0.85 0.8 2048 3072 4096 5120 6144 7168 8192 9216 10240 11264 12288 Matrix size (n) Matrix size (n) These techniques introduce a minimal overhead in execution time Longer execution time, due to the period required to wake-up blocked threads Increase of 2.5 % for the smallest problem sizes, for other sizes there is no difference

Energy (Wh) 2.5 2 1.5 1 0.5 Introduction Environment setup Results Impact on absolute and relative consumption of LU factorization with partial pivoting using the energy-aware techniques Absolut energy SuperMatrix SuperMatrix with EA1 SuperMatrix with EA2 SuperMatrix with EA1+EA2 % Impact on energy 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 Relative energy w.r.t. the original runtime SuperMatrix with EA1 SuperMatrix with EA2 SuperMatrix with EA1+EA2 0 0.8 2048 3072 4096 5120 6144 7168 8192 9216 10240 11264 12288 2048 3072 4096 5120 6144 7168 8192 9216 10240 11264 12288 Matrix size (n) Matrix size (n) For smallest problem sizes, the number of tasks is relatively low compared with the number of threads Produces longer idle periods during the execution the algorithm Energy savings EA1 potentially leads to savings of 2 % and 5 % when using EA2 for largest problem sizes The combination of both techniques produce similar energy-savings than as the use only of EA2

Energy application (Wh) 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 Introduction Environment setup Results Impact on absolute and relative execution application consumption of LU factorization with partial pivoting using the energy-aware techniques Absolut energy application SuperMatrix SuperMatrix with EA1 SuperMatrix with EA2 SuperMatrix with EA1+EA2 % Impact on energy application 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 Relative energy application w.r.t. the original runtime SuperMatrix with EA1 SuperMatrix with EA2 SuperMatrix with EA1+EA2 0 0.8 2048 3072 4096 5120 6144 7168 8192 9216 10240 11264 12288 2048 3072 4096 5120 6144 7168 8192 9216 10240 11264 12288 Matrix size (n) Matrix size (n) New metric Application consumption To obtain energy consumption values subtract idle power (75W) for each result: E = ( P 75) t Energy gains considering only application using EA2 and EA1+EA2 range from 22 % for the smallest sizes to 7 % for the largest problem sizes.

Some conclusions: Idle periods appear when executing algorithms with data dependencies Race-to-Idle: Current processors are quite good saving power when idle, so it s generally better to run as fast as possible to produce longer idle periods Optimize idle periods for energy saving LU with partial pivoting Just an example of algorithm with idle periods! Two techniques to leverage inactive periods Results: Reduction of energy consumption of 5 % and 22 % if we only consider application consumption Negligible increase of execution time Basic idea to save energy: avoid active polling (busy-wait) in the runtime Do nothing well! David E. Culler

Thanks for your attention! Questions?