Parallel Performance Theory - 1

Similar documents
Parallel Performance Theory

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Analytical Modeling of Parallel Programs. S. Oliveira

Analytical Modeling of Parallel Systems

A Note on Parallel Algorithmic Speedup Bounds

Performance and Scalability. Lars Karlsson

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Lecture 17: Analytical Modeling of Parallel Programs: Scalability CSCE 569 Parallel Computing

Dense Arithmetic over Finite Fields with CUMODP

Running Time Evaluation

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Multi-threading model

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Performance Evaluation of Codes. Performance Metrics

Computational Complexity

Recap from the previous lecture on Analytical Modeling

Analysis of Algorithm Efficiency. Dr. Yingwu Zhu

Algorithm Efficiency. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

CS 4407 Algorithms Lecture 2: Growth Functions

EECS 477: Introduction to algorithms. Lecture 5

Analysis of Algorithms

Topic 17. Analysis of Algorithms

Saving Energy in Sparse and Dense Linear Algebra Computations

Parallel Program Performance Analysis

How to deal with uncertainties and dynamicity?

Special Nodes for Interface

EECS 358 Introduction to Parallel Computing Final Assignment

Scalability Metrics for Parallel Molecular Dynamics

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

Algorithms. Grad Refresher Course 2011 University of British Columbia. Ron Maharik

UC Santa Barbara. Operating Systems. Christopher Kruegel Department of Computer Science UC Santa Barbara

Real-Time Systems. Event-Driven Scheduling

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

Algorithm efficiency can be measured in terms of: Time Space Other resources such as processors, network packets, etc.

CSED233: Data Structures (2017F) Lecture4: Analysis of Algorithms

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College

Applications of Mathematical Economics

CIS 121. Analysis of Algorithms & Computational Complexity. Slides based on materials provided by Mary Wootters (Stanford University)

Announcements. Friday Four Square! Problem Set 8 due right now. Problem Set 9 out, due next Friday at 2:15PM. Did you lose a phone in my office?

Harvard CS 121 and CSCI E-121 Lecture 20: Polynomial Time

0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA

Measurement & Performance

Measurement & Performance

CSC Design and Analysis of Algorithms. Lecture 1

On The Energy Complexity of Parallel Algorithms

Perm State University Research-Education Center Parallel and Distributed Computing

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Copyright 2000, Kevin Wayne 1

Parallel Performance Evaluation through Critical Path Analysis

Embedded Systems Design: Optimization Challenges. Paul Pop Embedded Systems Lab (ESLAB) Linköping University, Sweden

TIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION

Performance Evaluation of Scientific Applications on POWER8

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Asymptotic Notation. such that t(n) cf(n) for all n n 0. for some positive real constant c and integer threshold n 0

csci 210: Data Structures Program Analysis

Average-Case Performance Analysis of Online Non-clairvoyant Scheduling of Parallel Tasks with Precedence Constraints

Multiprocessor Scheduling II: Global Scheduling. LS 12, TU Dortmund

On the Complexity of Mapping Pipelined Filtering Services on Heterogeneous Platforms

Announcements. CompSci 102 Discrete Math for Computer Science. Chap. 3.1 Algorithms. Specifying Algorithms

Cpt S 223. School of EECS, WSU

Programming, Data Structures and Algorithms Prof. Hema Murthy Department of Computer Science and Engineering Indian Institute Technology, Madras

Algorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen

Markov Decision Processes Chapter 17. Mausam

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Friday Four Square! Today at 4:15PM, Outside Gates

CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms

csci 210: Data Structures Program Analysis

Computer Sciences Department

Lecture 2. More Algorithm Analysis, Math and MCSS By: Sarah Buchanan

Complexity Theory Part II

Online Scheduling Switch for Maintaining Data Freshness in Flexible Real-Time Systems

Mathematical or Physical? A physical analysis of mechanical computability. Answers. Abstract model. Mental calculator. Conceptual Outline

Scheduling divisible loads with return messages on heterogeneous master-worker platforms

A physical analysis of mechanical computability. Answers

ONLINE SCHEDULING OF MALLEABLE PARALLEL JOBS

Lecture 2. Fundamentals of the Analysis of Algorithm Efficiency

A Blackbox Polynomial System Solver on Parallel Shared Memory Computers

Greedy Algorithms My T. UF

Multiprocessor Scheduling of Age Constraint Processes

Real-Time Systems. Event-Driven Scheduling

CS473 - Algorithms I

3. Scheduling issues. Common approaches 3. Common approaches 1. Preemption vs. non preemption. Common approaches 2. Further definitions

Definition: Alternating time and space Game Semantics: State of machine determines who

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Dynamic Scheduling within MAGMA

CS 374: Algorithms & Models of Computation, Spring 2017 Greedy Algorithms Lecture 19 April 4, 2017 Chandra Chekuri (UIUC) CS374 1 Spring / 1

Data Structures in Java

Asymptotic Analysis. Thomas A. Anastasio. January 7, 2004

Data Structures and Algorithms. Asymptotic notation

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Supporting Intra-Task Parallelism in Real- Time Multiprocessor Systems José Fonseca

Ch 01. Analysis of Algorithms

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Analysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel

Growth of Functions. As an example for an estimate of computation time, let us consider the sequential search algorithm.

Topics in Complexity Theory

Module 1: Analyzing the Efficiency of Algorithms

A hierarchical Model for the Analysis of Efficiency and Speed-up of Multi-Core Cluster-Computers

Transcription:

Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science

Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis law 2

What is Performance? q In computing, performance is defined by 2 factors Computational requirements (what needs to be done) Computing resources (what it costs to do it) q Computational problems translate to requirements q Computing resources interplay and tradeoff Performance ~ 1 Resources for solution and ul0mately Hardware Time Energy Money 3

Why do we care about Performance? q Performance itself is a measure of how well the computational requirements can be satisfied q We evaluate performance to understand the relationships between requirements and resources Decide how to change solutions to target objectives q Performance measures reflect decisions about how and how well solutions are able to satisfy the computational requirements The most constant difficulty in contriving the engine has arisen from the desire to reduce the time in which the calculations were executed to the shortest which is possible. Charles Babbage, 1791 1871 4

What is Parallel Performance? q Here we are concerned with performance issues when using a parallel computing environment Performance with respect to parallel computation q Performance is the raison d être for parallelism Parallel performance versus sequential performance If the performance is not better, parallelism is not necessary q Parallel processing includes techniques and technologies necessary to compute in parallel Hardware, networks, operating systems, parallel libraries, languages, compilers, algorithms, tools, q Parallelism must deliver performance How? How well? 5

Performance Expectation (Loss) q If each processor is rated at k MFLOPS and there are p processors, should we see k*p MFLOPS performance? q If it takes 100 seconds on 1 processor, shouldn t it take 10 seconds on 10 processors? q Several causes affect performance Each must be understood separately But they interact with each other in complex ways Solution to one problem may create another One problem may mask another q Scaling (system, problem size) can change conditions q Need to understand performance space 6

Embarrassingly Parallel Computations q An embarrassingly parallel computation is one that can be obviously divided into completely independent parts that can be executed simultaneously In a truly embarrassingly parallel computation there is no interaction between separate processes In a nearly embarrassingly parallel computation results must be distributed and collected/combined in some way q Embarrassingly parallel computations have potential to achieve maximal speedup on parallel platforms If it takes T time sequentially, there is the potential to achieve T/P time running in parallel with P processors What would cause this not to be the case always? 7

Scalability q A program can scale up to use many processors What does that mean? q How do you evaluate scalability? q How do you evaluate scalability goodness? q Comparative evaluation If double the number of processors, what to expect? Is scalability linear? q Use parallel efficiency measure Is efficiency retained as problem size increases? q Apply performance metrics 8

Performance and Scalability q Evaluation Sequential runtime (T seq ) is a function of problem size and architecture Parallel runtime (T par ) is a function of problem size and parallel architecture # processors used in the execution Parallel performance affected by algorithm + architecture q Scalability Ability of parallel algorithm to achieve performance gains proportional to the number of processors and the size of the problem 9

Performance Metrics and Formulas q T 1 is the execution time on a single processor q T p is the execution time on a p processor system q S(p) (S p ) is the speedup q E(p) (E p ) is the efficiency S( p) = T 1 T p q Cost(p) (C p ) is the cost Efficiency = S p p q Parallel algorithm is cost-optimal Cost = p T p Parallel time = sequential time (C p = T 1, E p = 100%) 10

Amdahl s Law (Fixed Size Speedup) q Let f be the fraction of a program that is sequential 1-f is the fraction that can be parallelized q Let T 1 be the execution time on 1 processor q Let T p be the execution time on p processors q S p is the speedup S p = T 1 / T p = T 1 / (ft 1 +(1-f)T 1 /p)) q As p = 1 / (f +(1-f)/p)) S p = 1 / f 11

Amdahl s Law and Scalability q Scalability Ability of parallel algorithm to achieve performance gains proportional to the number of processors and the size of the problem q When does Amdahl s Law apply? When the problem size is fixed Strong scaling (p, S p = S 1 / f ) Speedup bound is determined by the degree of sequential execution time in the computation, not # processors!!! Uhh, this is not good Why? Perfect efficiency is hard to achieve q See original paper by Amdahl on webpage 12

Gustafson-Barsis Law (Scaled Speedup) q Often interested in larger problems when scaling How big of a problem can be run (HPC Linpack) Constrain problem size by parallel time q Assume parallel time is kept constant T p = C = (f +(1-f)) * C f seq is the fraction of T p spent in sequential execution f par is the fraction of T p spent in parallel execution q What is the execution time on one processor? Let C=1, then T s = f seq + p(1 f seq ) = 1 + (p-1)f par q What is the speedup in this case? S p = T s / T p = T s / 1 = f seq + p(1 f seq ) = 1 + (p-1)f par 13

Gustafson-Barsis Law and Scalability q Scalability Ability of parallel algorithm to achieve performance gains proportional to the number of processors and the size of the problem q When does Gustafson s Law apply? When the problem size can increase as the number of processors increases Weak scaling (S p = 1 + (p-1)f par ) Speedup function includes the number of processors!!! Can maintain or increase parallel efficiency as the problem scales q See original paper by Gustafson on webpage 14

Amdahl versus Gustafson-Baris 15

Amdahl versus Gustafson-Baris 16

DAG Model of Computation q Think of a program as a directed acyclic graph (DAG) of tasks A task can not execute until all the inputs to the tasks are available These come from outputs of earlier executing tasks DAG shows explicitly the task dependencies q Think of the hardware as consisting of workers (processors) q Consider a greedy scheduler of the DAG tasks to workers No worker is idle while there are tasks still to execute 17

Work-Span Model q T P = time to run with P workers q T 1 = work Time for serial execution execution of all tasks by 1 worker Sum of all work q T = span Time along the critical path q Critical path Sequence of task execution (path) through DAG that takes the longest time to execute Assumes an infinite # workers available 18

Work-Span Example q Let each task take 1 unit of time q DAG at the right has 7 tasks q T 1 = 7 All tasks have to be executed Tasks are executed in a serial order Can the execute in any order? q T = 5 Time along the critical path In this case, it is the longest pathlength of any task order that maintains necessary dependencies 19

Lower/Upper Bound on Greedy Scheduling q Suppose we only have P workers q We can write a work-span formula to derive a lower bound on T P Max(T 1 / P, T ) T P q T is the best possible execution time q Brent s Lemma derives an upper bound Capture the additional cost executing the other tasks not on the critical path Assume can do so without overhead T P (T 1 - T ) / P + T 20

Consider Brent s Lemma for 2 Processors q T 1 = 7 q T = 5 q T 2 (T 1 - T ) / P + T (7 5) / 2 + 5 6 21

Amdahl was an optimist! 22

Estimating Running Time q Scalability requires that T be dominated by T 1 T P T 1 / P + T if T << T 1 q Increasing work hurts parallel execution proportionately q The span impacts scalability, even for finite P 23

Parallel Slack q Sufficient parallelism implies linear speedup 24

Asymptotic Complexity q Time complexity of an algorithm summarizes how the execution time grows with input size q Space complexity summarizes how memory requirements grow with input size q Standard work-span model considers only computation, not communication or memory q Asymptotic complexity is a strong indicator of performance on large-enough problem sizes and reveals an algorithm s fundamental limits 25

Definitions for Asymptotic Notation q Let T(N) mean the execution time of an algorithm q Big O notation T(N) is a member of O(f(N)) means that T(N) c f(n) for constant c q Big Omega notation T(N) is a member of Ω(f(N)) means that T(N) c f(n) for constant c q Big Theta notation T(N) is a member of Θ(f(N)) means that c 1 f(n) T(N) < c 2 f(n) for constants c 1 and c 2 26