Minimizing Energy Consumption of MPI Programs in Realistic Environment

Similar documents
Problem Set 9 Solutions

The Minimum Universal Cost Flow in an Infeasible Flow Network

Embedded Systems. 4. Aperiodic and Periodic Tasks

Resource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Kernel Methods and SVMs Extension

Simultaneous Optimization of Berth Allocation, Quay Crane Assignment and Quay Crane Scheduling Problems in Container Terminals

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Module 9. Lecture 6. Duality in Assignment Problems

Hidden Markov Models

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

NP-Completeness : Proofs

Real-Time Systems. Multiprocessor scheduling. Multiprocessor scheduling. Multiprocessor scheduling

Calculation of time complexity (3%)

Lecture Notes on Linear Regression

On the Multicriteria Integer Network Flow Problem

ECE559VV Project Report

Lecture 4. Instructor: Haipeng Luo

Lecture 4: November 17, Part 1 Single Buffer Management

Single-Facility Scheduling over Long Time Horizons by Logic-based Benders Decomposition

Feature Selection: Part 1

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1]

Two Methods to Release a New Real-time Task

MMA and GCMMA two methods for nonlinear optimization

EEL 6266 Power System Operation and Control. Chapter 3 Economic Dispatch Using Dynamic Programming

Difference Equations

NUMERICAL DIFFERENTIATION

Lecture 10 Support Vector Machines II

Assortment Optimization under MNL

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Appendix B: Resampling Algorithms

Chapter - 2. Distribution System Power Flow Analysis

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

HMMT February 2016 February 20, 2016

A FAST HEURISTIC FOR TASKS ASSIGNMENT IN MANYCORE SYSTEMS WITH VOLTAGE-FREQUENCY ISLANDS

Singular Value Decomposition: Theory and Applications

More metrics on cartesian products

Lecture 5 Decoding Binary BCH Codes

An Interactive Optimisation Tool for Allocation Problems

10. Canonical Transformations Michael Fowler

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

ON A DETERMINATION OF THE INITIAL FUNCTIONS FROM THE OBSERVED VALUES OF THE BOUNDARY FUNCTIONS FOR THE SECOND-ORDER HYPERBOLIC EQUATION

Chapter Newton s Method

Finding Primitive Roots Pseudo-Deterministically

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Physics 5153 Classical Mechanics. Principle of Virtual Work-1

Lecture 14: Bandits with Budget Constraints

CHAPTER 17 Amortized Analysis

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

The Second Anti-Mathima on Game Theory

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Lecture Space-Bounded Derandomization

A Simple Inventory System

Foundations of Arithmetic

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Physics 5153 Classical Mechanics. D Alembert s Principle and The Lagrangian-1

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Time-Varying Systems and Computations Lecture 6

Annexes. EC.1. Cycle-base move illustration. EC.2. Problem Instances

Improved Worst-Case Response-Time Calculations by Upper-Bound Conditions

Two-Phase Low-Energy N-Modular Redundancy for Hard Real-Time Multi-Core Systems

The optimal delay of the second test is therefore approximately 210 hours earlier than =2.

Lecture 4: Universal Hash Functions/Streaming Cont d

Some modelling aspects for the Matlab implementation of MMA

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

COS 521: Advanced Algorithms Game Theory and Linear Programming

Lecture 20: November 7

Economics 101. Lecture 4 - Equilibrium and Efficiency

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Reclaiming the energy of a schedule: models and algorithms

Generalized Linear Methods

Canonical transformations

Fundamental loop-current method using virtual voltage sources technique for special cases

Errors for Linear Systems

Edge Isoperimetric Inequalities

Analysis of Queuing Delay in Multimedia Gateway Call Routing

Queueing Networks II Network Performance

Math 261 Exercise sheet 2

The Order Relation and Trace Inequalities for. Hermitian Operators

Curve Fitting with the Least Square Method

APPENDIX A Some Linear Algebra

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Online Appendix: Reciprocity with Many Goods

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

find (x): given element x, return the canonical element of the set containing x;

TRANSPOSE ON VERTEX SYMMETRIC DIGRAPHS

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

The Study of Teaching-learning-based Optimization Algorithm

Complete subgraphs in multipartite graphs

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

TOPICS MULTIPLIERLESS FILTER DESIGN ELEMENTARY SCHOOL ALGORITHM MULTIPLICATION

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Games of Threats. Elon Kohlberg Abraham Neyman. Working Paper

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

A Robust Method for Calculating the Correlation Coefficient

= z 20 z n. (k 20) + 4 z k = 4

The L(2, 1)-Labeling on -Product of Graphs

Transcription:

Mnmzng Energy Consumpton of MPI Programs n Realstc Envronment Amna Guermouche, Ncolas Trquenaux, Benoît Pradelle and Wllam Jalby Unversté de Versalles Sant-Quentn-en-Yvelnes arxv:1502.06733v2 [cs.dc] 25 Feb 2015 Abstract Dynamc voltage and frequency scalng proves to be an effcent way of reducng energy consumpton of servers. Energy savngs are typcally acheved by settng a well-chosen frequency durng some program phases. However, determnng sutable program phases and ther assocated optmal frequences s a complex problem. Moreover, hardware s constraned by non neglgble frequency transton latences. Thus, varous heurstcs were proposed to determne and apply frequences, but evaluatng ther effcency remans an ssue. In ths paper, we translate the energy mnmzaton problem nto a mxed nteger program that specfcally models realstc hardware lmtatons. The problem soluton then estmates the mnmal energy consumpton and the assocated frequency schedule. The paper provdes two dfferent formulatons and a dscusson on the feasblty of each of them on realstc applcatons. 1 Introducton For a very long tme, computng performance was the only metrc consdered when launchng a program. Scentsts and users only cared about the tme t took for a program to fnsh. Though stll often true, the prorty of many hardware archtects and system admnstrators has shfted to carng more and more about energy consumpton. Solutons reducng the energy enveloppe have been put forth. Among the dfferent exstng technques, Dynamc Voltage and Frequency Scalng (DVFS) proved to be an effcent way to reduce processor energy consumpton. The processor frequency s adapted accordng to ts workload: When the frequency s lowered wthout ncreasng the executon tme, the power consumpton and energy are reduced. Wth parallel applcatons n general, and more precsely wth MPI applcatons, reducng frequency on one processor may have a dramatc mpact on the executon tme of the applcaton: Reducng processor frequency may delay a message sendng, and maybe ts recepton. Ths may lead to cascadng delays ncreasng the executon tme. To save energy wth respect to applcaton deadlne, two man solutons exst: onlne tools and offlne schedulng. The former try to provde the frequency schedule durng the executon whereas the latter provde t after an offlne study. They both requre the applcaton task graph (ether through a prevous executon or by focusng on teratve applcatons). Many onlne tools [?,?] dentfy the crtcal path: the longest path through the graph, and focus on processors that do not execute these tasks. Typcally, when watng for a message, the processor frequency s set to the mnmal frequency untl the message arrves [?]. Although onlne tools allow some energy savngs, they provde suboptmal energy savng because of a lack of applcaton knowledge. On the other hand, offlne schedulng algorthms [?,?] provde the best frequency executon of each task. However, none of the exstng algorthms consder most current mult-core archtectures characterstcs: () cores wthn the same processor share the same frequency [?] and () swtchng frequency requres some tme [?]. Ths paper presents two models based on lnear programmng whch fnd the executon frequences of each task whle takng nto account the mutlcore archtecture constrants and characterstcs (secton 3) prevously descrbed. Moreover, we allow the executon tme to be ncreased f ths leads to more energy

savngs. The user provdes a maxmum performance degradaton that she can tolerate. The presented models provde optmal frequency schedule whch mnmzes the energy consumpton. However, when consderng large applcatons and large machnes, no current solver can provde a result, even parallel ones. The reason behnd ths ssue s dscussed n secton 3. 2 Context and executon model We consder MPI applcatons runnng on a mult-node platform. The targeted archtectures consder the followng characterstcs: () the latency of frequency swtchng s not neglgble and () cores wthn the same processor share the same frequency. A process, runnng on every core, executes a set of tasks. A task, denoted T, s defned as the computatons between two communcatons. The applcaton executon s represented as task graph where tasks are vertces and edges are messages between the tasks. Fgure 1 s an example of the task graph runnng on two processes. One process executes tasks T 1 and T 2 whle the other one executes tasks T 3 and T 4. T 1 T 3 T 2 T 4 Fgure 1: Task graph Before gong nto more detals on the executon model, let us provde an example of the problem we want to solve. Consder the example provded n Fgure 2. The applcaton s executed on 3 cores, 2 n the same processor and one n another processor. Tasks T 1, T 2, T 3 and T 4 are executed on processor 0 whle tasks T 5 and T 6 are executed on processor 1. In order to mnmze the energy consumpton through DVFS, we make the same assumpton as [?]: tasks may have several phases and each phase can be executed at a specfc frequency. Typcally on Fgure 2, task T 1 s dvded nto 3 phases. The frst one s executed at frequency f 1, the second one at frequency f 2 and the last one at frequency f 3. As stressed out before, settng a frequency takes some tme. In other words, when a frequency s requested, t s not set mmedately. Thus, on Fgure 2, when frequency f 2 s requested, t s set some tme after. One needs to be careful of such stuatons snce a frequency may be set after the task whch t was requested from s over. Moreover, cores wthn the same processor run at the same frequency. Hence, on Fgure 2, when f 1 s frst set on processor0, all the tasks beng executed at ths tme (T 1 and T 3 ) are executed at frequency f 1. T 5 s not affected snce t s on another processor. To provde the best frequency to execute each task porton, we need to consder all parallel tasks whch are executed at the same tme on the processor. 0 1 f 1 request f 2 f 2 T 1 T 3 T 5 f 3 f 3 f 1 f 1 T 2 T 4 T 6 f 2 Fgure 2: Frequency swtch latency 1 1 Note that only the latency of the frst request s represented

Our model requres the task graph to be provded (through proflng or a complete executon of the applcaton). Thus, we consder determnstc applcatons: for the same parameters and the same nput data, the same task graph s generated. In order to guarantee that edges are the same over all possble executons, one has to make sure that the communcatons between the processes are the same. Non determnstc communcatons n MPI are ether receptons from an unknown source (by usng MPI Any Source n the recepton call), or non-determnstc completon events (MPI Watany for nstance). Any applcaton wth such events s consdered as non-determnstc, thus falls out of the scope of the proposed soluton. T 1 T 3 slack T 2 T 4 Fgure 3: Slack tme Tasks wthn a core are totally ordered. If a task T ends wth a send event, then the followng task T j starts exactly at the end of T. On Fgure 1, task T 2 starts exactly after T 1 ends. On the other hand, when a task s created by a message recepton (T 4 on Fgure 1), t cannot start before all the tasks t depends on fnsh (T 1 and T 3 ) and t has to wat for the message to be receved. If the message arrves after the end of the task whch s supposed to receve t, the tme between the end of the task and the recepton s known as slack tme. On Fgure 3, tasks T 1 sends a message to T 3 but T 3 ends before recevng the messages creatng the slack represented by dotted lnes. A task energy consumpton E s defned as the product of ts executon tme exec and ts power consumpton P. Snce the applcaton s composed of several tasks, ts energy consumpton can be expressed as the sum of the energy consumpton of all the tasks. Thus, the goal translates nto provdng the set of frequency to execute each task. Hence, one can calculate the applcaton energy consumpton as: E = (E ) = (exec P ) (1) Mnmzng the energy consumpton of the applcaton s equvalent to mnmzng E n equaton (1). For each task T, both exec and P depend the frequency of the dfferent phases of the task. In addton, tasks are not ndependent snce when executed n parallel on the same processor, the tasks share the same frequency. Moreover, the overall executon tme of the applcaton depends on all the exec and the slack tme. To mnmze the energy consumpton whle stll controllng the overall executon tme, we express the problem as a lnear program. 3 Buldng the lnear program The followng paragraphs descrbe how the energy mnmzaton problems translates nto a lnear programmng. We frst descrbe the precedence constrants between the tasks, then we descrbe two formulatons whch consder the archtecture constrants. Fnally, we dscuss the feasblty of the descrbed solutons. 3.1 Precedence constrants Let T be a task defned by ts start tme bt and ts end tme et. The begnnng of tasks s bounded by the precedence relaton between them. As already stressed out, a task cannot start before ts drect predecessors complete ther executon. As explaned n secton 2, f T sends a message, ts chld task T j starts exactly when T ends snce the end of the communcaton means the begnnng of the next task. Ths translates to: bt j = et

bt et bts ets exec f tt f δ f M j Begnnng of a task T End of a task T Begnnng of a slack task Ts End of a slack task Ts The executon tme of a task T f executed completely at frequency f The tme durng whch the task T s executed at frequency f The fracton of tme a task T spends at frequency f Message transmsson tme from task T j to task T Table 1: Task varables On the other hand, when T ends wth a message recepton from T k, one has to make sure that ts successor task T j starts after both tasks end. Moreover, as ponted out n secton 2, when a task receves a message, some slack may be ntroduced before the recepton. Slack s handled the same way tasks are: t has a start and an end tme and t can be executed at dfferent frequences dependng on the tasks on the other cores. On Fgure 3, the slack after T 3 may be executed at dfferent frequences whether t s executed n parallel wth T 1 or T 2. To ease the presentaton, we assume that each task T recevng a message (from a task T k ) s followed by a slack task, denoted Ts. The begnnng of Ts, denoted bts s exactly equal to the end of T, bts = et (2) whereas ts end tme, denoted ets, s at least equal to the arrval tme of the message from T k. Let M k denote the transmsson tme from T k to T. Thus: ets et k +M k (3) Note that a task may receve messages from dfferent processes (after a collectve communcaton for example) and equaton 3 has to be vald for all of them. Fnally, snce T j, the successor task of T has to start after T and T k fnsh, one just needs to make sure that: bt j = ets In order to compute the end tme of a task T (et ), one has to evaluate the executon tme of T. As explaned above, a task may be executed at dfferent frequences. Let exec f be the executon tme of T f executedcompletely atfrequencyf. Everyfrequencycanbe usedto runafractonδ f ofthe totalexecutonof the task. Let tt f be the fracton of tme T spends at frequency f. It can be expressed as: tt f = δ f execf. Thus, the end tme of a task s: et = bt + f tt f Note that one has to make sure that a task s completely executed: δ f = 1 (4) f Fnally, snce the power consumpton depends on the frequency, let P f be the power consumpton of the task T when executed at frequency f. Usng ths formulaton, the objectve functon of the lnear program becomes: mn( T ( f (tt f P f ))) (5)

One can just use tt f n the objectve functon as t s expressed n equaton (5), and the solver would provde the values of tt f of all tasks at all frequences. Ths soluton was presented n [?]. The provded soluton can be used on dfferent archtectures than the ones we target n ths work. As a matter of fact, nothng constrans parallel tasks on one processor to run at the same frequency, and the threshold of swtchng frequency s not consdered ether. Moreover, no constrant on the executon tme s expressed. The followng paragraphs frst descrbe how the performance s handled then they ntroduce addtonal constrants the handle the archtecture constrants and executon tme. 3.2 Executon tme constrants The performance of an applcaton s a major concern; whether the energy consumpton s consdered or not. In ths paragraph we provde constrants whch consder the executon tme of the applcaton. In MPI, all programs end wth MPI Fnalze whch s smlar to a global barrer. Let last task be the last task on core (the MPI Fnalze task). Snce the applcaton ends wth a global communcaton, every task last task s followed by a slack task last slack task. The dfference between the global communcaton slack and the other slack tasks les n the end tme: the end tme of all slack tasks of a global communcaton s the same (all processes leave the barrer at the same tme). Thus, for every couple of cores (,j): elast slack task = elast slack task j (6) Let total Tme be the applcaton executon tme: It s equal to the end tme of the last slack task. total Tme = elast slack task (7) However, n some cases, ncreasng the executon tme of an applcaton could beneft to energy consumpton. In order to allow ths performance loss to a specfed extent, the user lmts the degradaton to a factor x of the maxmal performance. Let exec Tme be the executon tme when all tasks run at the maxmal frequency, and x the maxmum performance loss percentage allowed by the user. The followng constrant allows performance loss wth respect to x: exec Tme x total Tme exec Tme+ 100 The next sectons descrbe two dfferent formulatons. In the frst formulaton, the solver s provded wth all possble task confguratons and chooses the one mnmzng energy consumpton. In the second formulaton, the solver provdes the exact tme of every frequency swtch on each processor. 3.3 Archtecture constrants: the workload approach In order to provde the optmal frequency schedule, the lnear program s provded wth all possble task confguratons,.e., all possble of parallel tasks, known as workloads. Then the solver provdes the executon frequency of each workload. 3.3.1 Shared frequency constrant We need to express that tasks executed at the same tme on the same processor run at the same frequency. Hence, we frst need to dentfy tasks executed n parallel on the same processor. Dependng on the frequency beng used, the set of parallel tasks may change. Fgure 4 s an example of two dfferent executons runnng at the maxmal and mnmal frequency. Only processes that belong to the same processor are represented. In Fgure 4a, when the processor runs at f max, the set of couple of tasks whch are parallel s: (T 1,T 3 ),(T 1,Ts 3 ),(Ts 1,Ts 3 ),(T 2,T 4 )} (represented by red dotted lnes). When the frequency s set to f mn (Fgure 4b), the slack after T 3 s completely covered and the set of parallel tasks becomes: (T 1,T 3 ),(Ts 1,T 3 ),(T 2,T 4 )}.

bw ew tw f dw tw f Begnnng of a workload W End of a workload W The tme a workload W s executed at frequency f The duraton of a workload A bnary varable used to say f a workload s executed at a frequency f or not Table 2: Workload formulaton varables In order to provde all possble confguratons, we defne the processor workloads. A workload, denoted W s tuple of potentally parallel tasks. In Fgure 4, W 1 = (T 1,T 3 ), W 2 = (Ts 1,T 3 ), W 3 = (T 1,Ts 3 ) represent a subset of the possble workloads. Note that there are no workloads wth the same set of tasks. In other words, once a task n a workload s over, a new workload begns. On the other hand, a task can belong to several workloads (lke T 1 n Fgure 4a). T 3 T 1 T 1 T 3 Ts 3 Ts 1 Ts 1 T 2 T 4 T 2 T 4 (a) f max (b) f mn Fgure 4: Workloads Recall that our goal n to calculate the fracton of tme a tasks should spend at each frequency (tt f ) n order to mnmze the energy consumpton of the applcaton accordng to the objectve functon (5). Snce tasks may be executed at several frequences, so does a workload. In Fgure 5, the workload W 1 = (T 1,T 3 ) s executed at frequency f 1 then at frequency f 2. Thus, snce T 1 belongs to both W 1 = (T 1,T 3 ) and W 2 = (T 1,Ts 3 ), the executon tme of T 1 at frequency f 1 (tt f1 1 ) can be calculated by usng the fracton of tme W 1 and W 2 spend at frequency f 1. In other words, the executon tme of a task can be calculated accordng to the executon tme of the workloads t belongs to. Let tw f be the fracton of tme the workload W spends at frequency f. Thus: tt f = tw f j (8) W j,t W j Usng the executon tme of a workload at a specfc frequency (tw f ), one can calculate the duraton of a workload, dw as: f 1 T 1 T 3 W 1 f 2 W 2 f 1 f 2 T 2 Ts 3 T 4 W 3 W 4 Fgure 5: Workloads and tasks executon

dw = f tw f 3.3.2 Handlng frequency swtch delay Recall that one of the problems when consderng DVFS s the tme requred to actually set a new frequency. Thus, before settng a frequency, one has to make sure that duraton of the workload s long enough to tolerate the frequency change snce changng frequency takes some tme. In other words, f the frequency f s set n a W, tw f s larger than a user-defned threshold, denoted T h. W, f : tw f Th tw f (9) tw f s a bnary varable used to guarantee that defnton (9) remans true when tw f = 0. tw f 0 tw f = 0 = 1 otherwse (10) The expresson of defnton (10) as a mxed bnary programmng formulaton s expressed n the appendx. 3.3.3 Vald workload flterng The lnear program s provded wth all possble workloads, then t provdes the dfferent tw f j for each workload. However, all workloads cannot be present n one executon. In Fgure 4, W 1 = (T 1,Ts 3 ) and W 2 = (Ts 1,T 3 ) are both possble workloads, but they cannot be n the same executon, because f W 1 s beng executed, t means that T 3 s over (snce Ts 3 s after T 3 ) thus W 2 cannot appear later snce Ts 1 and T 3 are never parallel. Thus, n order to prevent W 1 and W 2 from both exstng n one executon, we just need to check whether the tasks of the workload can be parallel or not. Two tasks are not parallel f one ends before the begnnng of the second. Snce we consder workloads, we focus only on the begnnng and end tme of the workload tself. Let bw and ew be the start tme and the end tme of the workload W j = (T 1,...,T,...,T n ). They are such that: bw j >= bt (11) ew j <= et (12) Note that although the begnnng and the end of the workload are not exactly defned, ths defnton makes sure that the begnnng or the end of a task start a new workload. Moreover, the complete executon of a task are guaranteed thanks to equatons (4) and (8). Fgure 6 s an example of a workload that cannot exst. Let us assume the executon represented n Fgure 6, and let us focus on the workload W 1 = (T 1,Ts 3 ). Let us also assume that wth other frequences, a possbleworkloadsw 2 = (T 3,Ts 1 ). Asexplanedabove, W 1 andw 2 cannotbothexstnthesameexecuton because of precedence constrants. It s obvous from the example that T 3 and Ts 1 are not parallel, let us see how t translates to workloads. Snce W 2 has to start after both T 3 and Ts 1 begns, then t starts after Ts 1 (snce bts 1 bt 3 Fgure 6). The same way t ends before et 3. But snce et 3 bts 1 (as shown n Fgure 6) then the duraton of W 2 should be negatve whch s not possble. Thus, we dentfy workloads whch cannot be n the executon as workloads whch end before they begn. The duraton of a workload s such that: 0 ew < bw dw = (13) ew bw otherwse In the appendx (secton 6), we proove that f two workloads cannot be n the same executon (because of the precedence constrants), then the duraton of at least one of them s 0 (paragraph 6.4.2).

bt 3 T 1 T 3 Ts 1 T 2 Ts 3 et 3 ew 2 ets 1 and ew 2 et 3. Thus the workload must at most end here bts 1 bw 2 bts 1 and bw 2 bt 3. Thus the workload must at least start here ets 1 Fgure 6: Negatve workload duraton for mpossble workloads 3.3.4 Dscusson The appendx (secton 6) provdes a detaled formulaton of the energy mnmzaton problem usng workloads. The formulaton shows the use of two bnary varables: one to express the threshold constrant and one to calculate the duraton of the workload. Wth these two varables, the formulaton s not lnear anymore, whch requres more tme to solve (especally when the number of workloads s mportant). Moreover, we tred provdng all possble workloads of one of the NAS parallel benchmarks on class C on 16 processes (IS.C.16) on a machne equped wth 16 GB of memory. The applcaton task graph s composed of 630 tasks. The generated data (.e. the number of workloads) could not ft n the memory of the machne. Thus, even wth no bnary varables, provdng all possble workloads s not possble when consderng real applcatons. In the followng secton, we provde another formulaton whch requres only the task graph. 3.4 Archtecture constrants: the frequency swtch approach As explaned earler, our goal s to mnmze the energy consumpton of a parallel applcaton usng DVFS. In order to do so, we express the problem as a lnear program. We consder that the program s represented as a task graph and each task can have several phases. The dffculty of the formulaton s to provde, for each task, the frequency of each of ts phases (tt f ) snce one has to make sure that parallel tasks must run at the same frequency. In ths secton, we provde another formulaton whch consders the tme to set a new frequency on the whole processor nstead of consderng tasks ndependently and then force parallel tasks to run at the same frequency. 3.4.1 Frequency swtch overhead Let c f jp be the tme the frequency f s set on the processor p, j beng the sequence number of the frequency swtchng. Fgure 7 represents the executon of four tasks on two cores of the same processor p. In the example, we assume that there are only 3 possble frequences. The dfferent c f jp are numbered such that the mnmum frequency f 1 corresponds to the swtchng tme c f1 1p,cf1 4p,..., the frequency f 2 corresponds to the frequency changes c f2 2p,cf2 5p,... and so on. A frequency f 1 s appled durng a tme whch can be calculated as c f2. Ths can be translated to: +1}p cf1 p c f2 +1}p cf1 p c f p d f j Tme of the th frequency swtch on processor p. The frequency f s the one set The amount of tme a frequency f s set for the task for the frequency swtch j Table 3: Frequency swtch formulaton varables

c f 1 11 T 1 T 3 c f 2 21 Ts 3 c f 3 31 = c f 1 41 T 2 T 4 c f 2 51 Fgure 7: Frequency swtches example Note that some frequences may not be set f the duraton s zero. In fgure 7, frequency f 3 s not set snce c f3 31 = cf1 41. 3.4.2 Handlng frequency swtch delay As explaned earler, changng frequency takes some tme. Thus, for a change to be appled, ts duraton has to be longer than the user-defned threshold Th. Let ζ f p be a bnary varable, such that: ζ f p = 0 c f +1}p cf p = 0 (14) 1 otherwse The threshold condton can be expressed as: c f +1}p cf p Th ζf p We detal how equaton (14) s translated nto mxed bnary programmng constrants n the appendx. 3.4.3 Shared frequency constrants Once the threshold condton s satsfed, one can calculate the tme a task spends at each frequency,.e tt f, accordng to cf jp. On Fgure 7, ntally, tasks T 1 and T 3 run n parallel at frequency f 1. The tme T 3 spends at frequency f 1 s c f2 21 cf1 11 whereas T 1 s executed twce at f 1. It spends (c f2 21 cf1 11 )+(et 1 c f1 41 ) at frequency f 1. Let d f j be the tme the task T spends at frequency f after the frequency swtch j. Back to Fgure 7, d f1 11 = cf2 21 cf1 11 The above translates to: and df1 14 = et 1 c f1 41. ttf1 1 becomes tt f1 1 = d f1 11 +df1 14. tt f = j Note that a task s not mpacted by a frequency change f t ends before the change or begns after the next change. In other words, d f1 j = 0 f et c f1 jp or bt c f2 j+1}p. Otherwse, df1 j can be calculated as mn(et,c f2 j+1}p ) max(bt,c f1 jp ). 3.5 Dscusson d f j 0 d f j = et c f jp or bt c f +1}p mn(et,c f j+1}p ) max(bt,c f jp ) otherwse (15) The appendx (secton 6) provdes the complete formulaton of the problem usng the frequency swtch tme varables. In addton to the bnary varable used to satsfy the frequency swtch overhead, for each task and for each frequency swtch, fve addtonnal bnary varables are used. Thus, for n tasks and m frequency

swtch consdered, 5 n m bnary varables are requred. Mxed nteger programmng s NP-hard [?], thus, wth such a number of bnary varables, no soluton can be provded. When comparng the workload approach and the frequency swtch approach, one can notce that the former needs less bnary varables and should be able to provde results. However, because all possble workloads have to be provded to the solver, t s as complex because of the memory requred. Thus, f a very large memory s avalable, then the workload soluton s the one to be used. And f new faster bnary resoluton technques are provded, then the frequency swtch soluton should be used. Several heurstcs can be assumed n order to reduce the tme to solve the problem. Frst, one can consder teratve applcatons, and solve the problems for only one teraton then apply t the remanng ones. However, ths soluton strongly depends on the number of tasks per teratons. We tred ths soluton on some kernels (NAS Parallel Benchmarks [?]) and the solver could not provde any result after several hours. The most promsng heurstc s to consder the tasks at the processor level nstead of the core level. Thus, the only archtecture constrant whch needs to be consdered s the frequency overhead one. Ths study s part of our current work and wll be dscussed n further studes. 4 Related Work DVFS schedulng has been wdely used to mprove processor energy consumpton durng applcaton executon. We focus on studes assumng a set of dependent tasks represented as a drect acyclc graph (DAG). A lot of studes tackle task mappng problem whle mnmzng energy consumpton ether wth respect to task deadlnes[?] or by tryng to mnmze the deadlne as well [?]. When consderng an already mapped task graph, studes provde the executon speed of each task dependng on the frequency model: contnuous [?] or dscrete [?]. Some studes also provde a set of frequences to execute a task [?] (executng a task at multple frequences s known as VDD-Hoppng). In [?], the authors present a complexty study of the energy mnmzaton problem dependng on the frequency model (contnuous frequences, dscrete frequences wth and wthout VDD-Hoppng). Fnally studes lke [?] and [?] consder frequency transton overhead. Although these studes should provde an optmal frequency schedule, they do not consder the constrants of most current archtectures and more specfcally the shared frequency among all cores of the same processor. When consderng lnear programmng formulaton to mnmze applcaton energy consumpton, many formulatons have been proposed n the past. When consderng sngle processor,[?] provdes an nteger lnear programmng formulaton wth neglgble frequency swtchng overhead. The same problem but consderng frequency transton overhead was addressed n [?]. The author also provde a lnear-tme heurstc algorthm whch provdes near-optmal soluton. The work presented n [?] s the closest to the work presented n ths paper. In [?], the authors present a lnear programmng formulaton of the mnmzaton energy problem where tasks can be executed at several frequences. Both slack energy and processor energy consumpton are consdered n the mnmzaton and a loose deadlne s consdered. In a smlar way, [?] provdes a schedulng algorthm and an nteger lnear programmng formulaton of the energy mnmzaton problem on heterogeneous systems wth a fxed deadlne. The formulaton s very close to the one descrbed n [?], but the authors also consdered communcaton energy consumpton. However, they do not consder slack tme and ts power consumpton when solvng the problem. In [?] the authors use an nteger lnear programmng formulaton of the problem where only task wth slack tme are slowed down, whereas other tasks are run at maxmal frequency. The program s used to compute the best frequency executon of a task. Although prevous studes provde dfferent solutons and formulatons for DVFS schedulng, few of them consder current archtecture constrants. Whle some prevous studes consder frequency transton overhead[?,?], noneofthemconsderthefactthatcoreswthn thesameprocessorrunatthesamefrequency. Ths paper descrbes a mxed lnear programmng formulaton that guarantees that parallel tasks on the same processor run at the same frequency. Moreover, t shows that t s possble to relax the deadlne f t leads to energy savng.

5 Concluson Thegoalofthspaperwastoprovdeastudyonhowenergymnmzatonproblemofaparallelexecutonofan MPI-lke program can be addressed and formulated when consderng most current archtecture constrants. In order to do so, we used lnear programmng formulaton. Two dfferent formulatons were descrbed. Ther goal s to mnmze the energy consumpton wth respect to a user-defned deadlne by provdng the optmal frequency schedule. Both solutons use a number of bnary varables whch s proportonal to the number of tasks. Used as they are, these formulatons should provde an optmal soluton but are costly n terms of memory and resoluton tme, despte the use of fast parallel solvers lke gurob [?]. We are currently workng on ntroducng heurstcs to relax the archtecture constrants by buldng tasks on the processor level nstead of the core level. Usng such heurstcs seems to drastcally reduce the tme needed to solve the problem. 6 Appendx Ths appendx summarzes the set of constrants of both formulatons descrbed n paragraphs 3.3 and 3.4. We start by descrbng how each non lnear constrant whch appears n sectons 3.3 and 3.4 s expressed. For a more complete descrpton and explanaton, the reader can refer to [?]. 6.1 Expressng non lnear constrants Secton 3 presents dfferent non contnuous varables (defntons 10, (13) and (14), (15)). In ths secton, we brefly explan how ths knd of expressons translates to nequaltes usng bnary varables. 1. If-then statement wth 0-1 varables: Expressng condtons lke: 0 x = 0 x = 1 otherwse (for nstance, defnton 10) requres the use of a large constant M such that: x M x (16) x x ǫ (17) Thus, when x = 0, (17) forces x to be equal to 0 and when x 0, (16) s used to set the value of x to 1. Note that, equaton (9), whch guarantees that tw f Th tw f makes (17) useless (snce Th > ǫ). Thus, (17) s never used n the set of constrants. 2. If-then statement wth real varables: Expressng formulas lke: 0 y < x z = y x otherwse (defnton (13) for nstance) s smlar to the prevous formulaton n the sens that t requres the use of a bg constant M. A bnary varable bn s used such that when y x 0, bn = 0. y x M bn (18) x y M (1 bn) (19) Thus, when y x, (18) s always vald regardless the value of bn. Hence, (19) forces bn to be equal to 0. Smlarly, when y x, equaton (18) forces bn to 1.

Once bn s defned, z can be expressed as: y x z M bn (20) y x+z 2 (y x)+m (1 bn) (21) Thus, when y x, bn = 0 (from (18)) and (20) forces z to be 0 (snce all varable are postve) and (21) s always vald. Smlarly, when y x, bn = 1 (from (19)) and (20) and (21) become: Thus y x z y x whch makes z = y x. y x z M z y x 3. Maxmums: Maxmums can be expressed by reformulatng the defnton as: 0 x y z = max(x,y) = x+ y x otherwse Let w be such that: w = We can express w by usng (20) and (21). 0 x y y x otherwse 4. Mnmums: Expressng mnmums s based on the same dea than expressng maxmums: 0 x y z = mn(x,y) = x (x y) x y otherwse We do not detal how mnmums are expressed, snce t s done the same way as maxmums. 5. Expressng several condtons: In defntons lke (15), several condtons can force the value of a varable. 0 x y or z u w = 0 otherwse Translatng such defntons nto nequaltes requres the use of one bnary varable for each condton and one bnary varable to express the or. 1 f z u 0 1 f x y 0 Let bn1, bn2 be such that: bn1 = and bn2 = 0 otherwse 0 otherwse These two defntons can be expressed usng (16) and (17). Fnally bn3 s a bnary varable whch s equal to 1 f bn1 or bn2 are equal to 1 and 0 otherwse: 1 bn1+bn2 1 bn3 = (22) 0 otherwse Snce bn1, bn2 and bn3 are bnary varables, (22) can be easly expressed as: bn1 bn3 (23) bn2 bn3 (24) bn3 bn1 + bn2 (25) Thus, when bn1 and bn2 are 0, (25) forces bn3 to be 0 whereas when bn1 or bn2 are equal to 1, (23) and 24 forces bn3 to be equal to 1.

6.2 Objectve functon Mnmzng the energy consumpton of a program descrbed as a set of tasks s the objectve functon of the lnear programmng formulatons descrbed above. For a task T wth a power consumpton at a frequency f, P f and executed at frequency f durng tt f, the energy consumpton of the whole program for ts whole executon tme s: mn( T ( f (tt f P f ))) 6.3 Task constrants Let T,T +1,T +2,T j be four tasks such that: T,T +1,T +2 are consecutve and on the same processor. T ends wth a message sendng creatng T +1 whch ends wth a recepton from T j whch generates T +2 as shown n Fgure 8. T T j T +1 Ts +1 T +2 Fgure 8: Task confguraton et = bt + f δ f = 1 f tt f 6.4 Workload approach 6.4.1 Addtonal varable bt +1 = et bts +1 = et +1 ets +1 et j +M +1 j ets +1 bts +1 bt +2 = ets +1 tt f = δ f exectf γ : A bnary varable used to say f a workload duraton s 0 or not M : A large constant bw bt j ew et j tt f = dw = f j tw W j T W j f tw f

Usng (16), (17) and (9), we express defnton (10) as: tw f tw f Th tw f M tw f Usng (16), (17), (20) and (21) and γ as the bnary varable, we express defnton (13) as: ew bw M γ bw ew M (1 γ ),γ 0,1} ew bw dw M γ ew bw +dw 2 (ew bw )+M (1 γ ) 6.4.2 Proof of workload duraton We want to proove that f two workloads W and W are possble, but they volate the precedence constrant between the tasks, then the duraton of at least one of them s zero. We provde the proof for workloads wth a cardnalty equals to 2 snce the proof remans the same for larger workloads. Let W = (T,T j ) and W = (T,T j ) such that T preceeds T and T j preceeds T j. We want to prove that dw = 0 or dw = 0. Lemma 1. Let W = (T,T j ) and W = (T,T j ). If bt et and bt j et, then dw = 0 or dw = 0. Proof. Let us proove lemma 6.4.2 by contradcton. Let us assume that dw 0 and dw 0. dw 0 ew bw From defnton (10): dw 0 ew bw From constrants (11) and (12): bw bt bw bt j ew et (26) and bw bt bw bt j ew et ew et j ew et j But bt et and bt j et, thus: bw btj et j ew (27) bw bt et ew (28) If we consder (27), (28) and (26): bw bt et bw ew Thus bw ew whch by defnton (10) mples that dw = 0 whch leads to a contradcton. 6.5 Frequency swtch approach Note that we do not detal how the threshold condton s handled snce t s done the same as for the workloads.

6.5.1 Addtonal varables ζ f p : A bnary varable used to say f a workload s executed at a frequency f or not y f j w f j α f j z f j g f j : The maxmum between bt and c f jp : A varable used to express y f j. It s equal to 0 f bt s the maxmum, and c f jp bt otherwse : A bnary varable used to verfy whether bt c f jp : The mnmum between et and c f j+1}p : A varable used to express z f j. It s equal to 0 f et s the mnmum, and et c f j+1}p otherwse β f j : A bnary varable used to verfy whether et c f j+1}p ψ f j : A bnary varable used to check f bt c f +1}p 0 φ f j : A bnary varable used to check f et c f p 0 ρ f j : A bnary varable used to check f ψ f j or φf j are true M : A large constant 6.5.2 Constrants c f +1}p c f p c f +1}p cf p Th ζ f p c f +1}p cf p M ζ f p tt f = d f j Expressng defnton (15) as nequaltes requres the use of (20) and(21) for the maxmum and the mnmum such that: y f j = max(bt,c f jp ) = bt +w f such that: w f j = 0 f bt s the maxmum c f j jp bt otherwse z f j = mn(et,c f j+1}p ) = et g f j such that: g f j = j 0 f et s the mnmum c f j+1}p et otherwse Let α f j be the bnary varable used for the maxmum and βf j the one used for the mnmum. By replacng the correspondng varables n (20) and (21), we obtan the followng nequaltes for the maxmum: c f jp bt M α f j bt c f jp M (1 α f j ),αf j 0,1} c f jp bt w f j M α f j c f jp bt +w f j 2 (c f jp bt )+M (1 α f j ) and the followng for the mnmum: et c f j+1}p M β f j c f j+1}p et M (1 β f j ),βf j 0,1} et c f j+1}p g f j M β f j et c f j+1}p +gf j 2 (et c f j+1}p )+M (1 βf j ) Fnally, usng (23), (24) and (25) and the bnary varables ψ f j, φf j and ρf j as bn1, bn2 and bn3 respectvely and usng (20) and (21), d j can be expressed as:

φ f j ρ f j ψ f j ρ f j ρ f j φ f j +ψf j z f j yf j d f j M (1 ρ f j ) z f j yf j +df j 2 (z f j yf j )+M ρf j