Generalizing Gustafson s Law For Heterogeneous Processors

Similar documents
Welcome to MCS 572. content and organization expectations of the course. definition and classification

Module 5: CPU Scheduling

Analytical Modeling of Parallel Systems

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

CSM: Operational Analysis

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

CS 700: Quantitative Methods & Experimental Design in Computer Science

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Parallel Performance Theory

Chapter 6: CPU Scheduling

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Parallel Performance Theory - 1

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Review for the Midterm Exam

Goals for Performance Lecture

Lecture 2: Metrics to Evaluate Systems

Modeling Parallel and Distributed Systems with Finite Workloads

A Blackbox Polynomial System Solver on Parallel Shared Memory Computers

Analysis of Software Artifacts

Análise e Modelagem de Desempenho de Sistemas de Computação

Scalability is Quantifiable

Performance Evaluation of Codes. Performance Metrics

A Note on Parallel Algorithmic Speedup Bounds

SOFTWARE ARCHITECTURE DESIGN OF GIS WEB SERVICE AGGREGATION BASED ON SERVICE GROUP

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

Computer Architecture

LSN 15 Processor Scheduling

Heterogenous Parallel Computing with Ada Tasking

CHAPTER 5 - PROCESS SCHEDULING

Fall 2008 CSE Qualifying Exam. September 13, 2008

de Computação ``E business: banking services Virgilio A. F. Almeida

Measurement & Performance

Measurement & Performance

Parallel Longest Common Subsequence using Graphics Hardware

TIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION

Performance and Scalability. Lars Karlsson

Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

Vector Lane Threading

ENEE350 Lecture Notes-Weeks 14 and 15

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

The parallelization of the Keller box method on heterogeneous cluster of workstations

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Performance of Computers. Performance of Computers. Defining Performance. Forecast

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun

Supporting Intra-Task Parallelism in Real- Time Multiprocessor Systems José Fonseca

CPU Consolidation versus Dynamic Voltage and Frequency Scaling in a Virtualized Multi-Core Server: Which is More Effective and When

MPI Implementations for Solving Dot - Product on Heterogeneous Platforms

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Lecture 17: Analytical Modeling of Parallel Programs: Scalability CSCE 569 Parallel Computing

2/5/07 CSE 30341: Operating Systems Principles

Parallel Program Performance Analysis

BIRTH DEATH PROCESSES AND QUEUEING SYSTEMS

Dense Arithmetic over Finite Fields with CUMODP

Revenue Maximization in a Cloud Federation

A Physical-Aware Task Migration Algorithm for Dynamic Thermal Management of SMT Multi-core Processors

Cache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Simulation of Process Scheduling Algorithms

Parallel Transposition of Sparse Data Structures

Modern Computer Architecture

How to deal with uncertainties and dynamicity?

Impact of Thread and Frequency Scaling on Performance and Energy in Modern Multicores: A Measurement-based Study

Load Regulating Algorithm for Static-Priority Task Scheduling on Multiprocessors

Queueing Theory I Summary! Little s Law! Queueing System Notation! Stationary Analysis of Elementary Queueing Systems " M/M/1 " M/M/m " M/M/1/K "

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

An Energy-efficient Task Scheduler for Multi-core Platforms with per-core DVFS Based on Task Characteristics

Revisiting Memory Errors in Large-Scale Production Data Centers

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

PERFORMANCE ANALYSIS AND RESOURCE ALLOCATION FOR MULTITHREADED MULTICORE PROCESSORS MIAO JU. Presented to the Faculty of the Graduate School of

Multitasking Polynomial Homotopy Continuation in PHCpack. Jan Verschelde

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

NICTA Short Course. Network Analysis. Vijay Sivaraman. Day 1 Queueing Systems and Markov Chains. Network Analysis, 2008s2 1-1

Sparse solver 64 bit and out-of-core addition

Parameter Estimation for the Dirichlet-Multinomial Distribution

A parallel Buchberger algorithm for multigraded ideals. M. Vejdemo-Johansson, E. Sköldberg and J. Dusek

Parallel Polynomial Evaluation

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis

Performance Evaluation of Queuing Systems

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

arxiv:quant-ph/ v1 28 May 1998

Cache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm

Cache Contention and Application Performance Prediction for Multi-Core Systems

Case Study IV: An E-Business Service

DMP. Deterministic Shared Memory Multiprocessing. Presenter: Wu, Weiyi Yale University

Multi-core Real-Time Scheduling for Generalized Parallel Task Models

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Transcription:

Generalizing Gustafson s Law For Heterogeneous Processors André A. Cesta, Geraldo M. Silva Eldorado Research Institute Brasilia - Campinas - Porto Alegre, Brazil {andre.cesta,geraldo.magela}@eldorado.org.br Abstract. In this article we show how Gustafson s Law can be generalized and extended from a function of the number of processors p to a function of intrinsic microprocessor variables such as: clock speed c and hardware threads t. This allows that the performance or benchmark results following Gustafson s Law be modeled for heterogeneous processors. Some possible applications of this theoretical work include all previous applications from Gustafson s Law and some others: hardware upgrade or migration scenario analysis; optimal hardware selection for particular workloads; sizing or recommendations for heterogeneous machines. Resumo. Neste artigo, vamos mostrar como a Lei de Gustafson pode ser generalizada e estendida a partir de uma função do número de processadores p para uma função de variáveis intrínsecas do microprocessador, tais como: velocidade do clock c e threads do hardware t, permitindo que o desempenho ou resultados de benchmark seguindo a Lei de Gustafson possam ser modelados para processadores heterogêneos. Dentre as possíveis aplicações deste trabalho teórico podem ser mencionadas todas as anteriores a partir da Lei de Gustafson e algumas outras: atualização de hardware ou análise do cenário de migração; seleção de hardware ideal para cargas de trabalho específicas; dimensionamento ou recomendações para máquinas heterogêneas. 1. Introduction In [Cesta et al. 2011] and [Cesta et al. 2012], the authors showed how computer processing scalability laws such as Amdahl s Law [Amdahl 1967] or Gunther s Universal Scalability Law (USL) [Gunther 2010] could be generalized to the case of heterogeneous processors by transforming them from number of processor functions to intrinsic processor characteristics functions. This paper demonstrates how Gustafson s Law [Gustafson 1988], another well-known scalability model, can also be generalized to the heterogeneous processor scenario. In this generalization, Gustafson s Law is also transformed from a function of the number of processors p, as in Eq. 1 below, to a function of processor intrinsic variables such as hardware threads t and clock speed c. Our work here is theoretical in nature. S( p) p α ( p 1) (1)

α ( 0,1) lim S( p) (2) p Notice in Eq. 2 that for α in between 0 and 1 exclusively, Gustafson s Law from Eq. 1 will scale to infinity as more and more processors are added. This is in contrast to Amdahl s Law which does not scale to infinity. Amdahl s Law, USL and Gustafson s Law are all models capable of predicting throughput or scalability when the homogeneous hardware processor counts change, given a certain workload type. Depending on the workload type, and on how the software scales, certain laws may better explain the scalability phenomena than others. Typically, for CPU-bound situations, one has to assess whether the scalability for a certain workload is better modeled by Gustafson s Law or by Amdahl s Law. While Gustafson s Law shows that some computations involving arbitrarily large data sets can be efficiently parallelized and scaled, Amdahl s Law and USL provide a counterpoint describing limits on the speed-up that parallelization provides. The speedup (p) or scalability according to Gustafson is given by: (p) pt(p) pt(1) ts + p tp ts + tp (3) That is, the ratio of program size with p processors to the program size with one processor. Notice that if ts is relatively small, doubling the number of processors (p) from one to two will allow close to double the number of parallel tasks (tp) to be fit in the same time frame, keeping the same ts time for all serial tasks. In the Methods section, we will generalize Gustafson s Law to the heterogeneous processor case. In the Discussion section, we will submit the demonstrated equation to a sensitivity analysis involving variables such as threads and clock speed. We will also issue some cautionary notes about applicability. Finally, in the Conclusion and Future Work Section, we will summarize the work results and list what we believe would fit as Future Work in this line of research. 2. Methods We define the total program size in tasks pt as: pt ts + tp (4) Read pt as program tasks, ts as the number of serial tasks and tp as the number of parallelizable tasks. In the original demonstration of Gustafson s Law [Gustafson 1988], the program size function pt when scaled to more processors p is: pt(p) ts + p tp (5) Defining α, the percent of serial tasks in a one-processor situation as: αts/(ts+tp), and β the percent of parallel tasks in a one-processor situation as βtp/(ts+tp), with α + β 1, we have Gustafson s Law as a function of the number of processors and the percent of serial tasks α:

ts + p tp ts tp ( p) + p ts + tp ts + tp ts + tp α + p (1 α) α + p α p (6) p α ( p 1) We now evolve or generalize this formulation to a heterogeneous machine scenario. Considering hardware threads t as serial processor p, and considering the effect of increased clock c in scaling all processor tasks, we can reformulate Eq. 5 as: pt( t) c ts + c t tp (7) This means that if we double the clock speed, the number of processor tasks executed would double as well; the assumption is that the processor can perform approximately twice as many serial tasks and parallel tasks. Here are some additional assumptions required for the model above, and for our model: 1. even parallelism, the capacity of the software to allocate the parallelizable part of a workload evenly to all processors [Sun and Ni 1993]; 2. scalable multiprocessor; 3. homogeneous multi-processors in a machine; 4. nominal clock (not a turbo-boosting clock) being used in the equation. The speed-up by Gustafson for a heterogeneous machine example generates a multivariate model based on threads t and clock speed c : speedup(t,c) (t,c) pt(t,c) c ts tp pt(t 1, c 1) ts + tp (8) Calling: ts/(ts+tp) α and tp/(ts+tp) (1 - α) and continuing the development from the equation we have: c ts tp ( t, c) c α (1 α) ts + tp α c c t α c ( α + t t α) c ( α (1 t) + t) c ( t α ( t 1)) (9) The above demonstration is based on the assumption that increased clock speeds will speed up every program time component, which is more likely to be true for processor-intensive programs and benchmarks. If we take this into consideration, we can decompose the program time pt in the following algebra [Cesta et al. 2012]: pt ( 1) tsi + tpi + tse + tpe (10) The program tasks pt, are broken down into: 1. tse, the program parts or tasks that are serial and perform external resource access; 2. tsi, the serial program tasks that perform internal work with no external resource access; 3. tpe, the parallelizable tasks

related to external resource access; 4. tpi, the parallelizable tasks that perform internal work with no external resource access. The program tasks pt with variable hardware threads t ; clock speed c ; and external resource access speed es according to Gustafson s scaling can be expressed as: pt( t, c tsi + c t tpi + es tse + es t tpe (11) Unlike variables c and t, which map directly to nominal values for processor clock and threads, variable es will typically be influenced by the difficulty of modeling interactions between various factors. Our tse and tpe program times component will also not map exactly to all the time the program spends relying on an external resource. Because tse and tpe is defined in terms of es in the equation, it will become just a fraction of the total time spent relying on an external resource. Continuing the demonstration from Eq. 11, we can redefine the speed-up by Gustafson for a heterogeneous machine situation previously presented in Eq. 9 as a new formulation that unlike Eq. 9, does not assume clock speed will speed up every program task: pt( t, ( t, pt(1,1,1) ( c tsi tpi + es tse + es t tpe) tsi + tpi + tse + tpe (12) One could use Eq. 12 as a model already and fit four unknowns: tsi, tpi, tse, tpe. In order to simplify equation fitting, we will reduce the number of unknowns from four to three. The algebraic expressions will not be shown on account of the paper scope. After those simplifications we obtain our final model and generalization from Gustafson s Law to heterogeneous machines: (t, i (c + π i (c t es (t 1) 1)+ es (t α (t 1)) (13) 3. Discussion The theoretical models here presented have not yet been numerically validated, as it was done for other analogous Amdahl s Law models in [Cesta et al. 2011] and [Cesta et al. 2012]. We expect percent error results similar to those verified with our Amdahl s Law generalization to be obtained with the Gustafson s Law variant presented in this article. Below we present a sensitivity analysis for Eq. 13 giving the general shape of the performance response surface for a system following Gustafson s Law when its thread count and clock speed vary.

Clock speed GHz 0.0 1.0 2.0 3.0 0 2 4 6 8 Threads 100 250 200 150 50 Figure 1. Sensitivity analysis for Gustafson s from threads and clock speed Eq. 13 for a configuration of: i0.9; πi0.88; α0.11 and es variable set to 1. It can be verified on the response surface from Figure 1 that doubling clockspeed is better than doubling threads for our configuration. Additionally, according to queuing theory, a system with half the service time (i.e. twice as fast) also performs better with respect to response times and throughput than a system with twice as many queue servers. This can be easily verified through simulation. 4. Conclusion and future work In this article we have demonstrated that Gustafson s Law can be generalized from a function of processors p to a function of hardware threads t and clock speed c. We have also performed a sensitivity analysis to better understand the response surface for Gustafson s scalability as a function of threads and clock speed. Future work in this area should focus on: 1. validating our models empirically; 2. extending the models with more variables such as amount of processor caches. References Amdahl, G. M. (1967) Validity of the Single-Processor Approach To Achieving Large Scale Computing Capabilities, in Proceedings of AFIPS, Atlantic City,NJ, AFIPS Press, pp. 483-485. Cesta, A., Takara, A. and Moscheto, D. (2011) Leveraging diverse regression approaches and heterogeneous machine data in the modeling of computer systems performance, in Proceedings of MSV, Las Vegas, Nevada, USA, pp. 201 207. Cesta, A., Silva, G. and Storch, M. (2012) Performance Prediction for Processors and External Resources, in Proceedings of CMG, Las Vegas, Nevada, USA. Gunther, N. (2010) Guerrilla Capacity Planning: A Tactical Approach to Planning for Highly Scalable Applications and Services. Springer. Gustafson, J. L. (1988) Reevaluating Amdahl s Law, in Communications of the ACM, vol. 31, no. 5, pp. 532-533. Sun, X. and Ni, L. (1993) Scalable problems and memory-bounded speedup, in Journal of Parallel and Distributed Computing.