Analysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel

Similar documents
Speedup for Multi-Level Parallel Computing

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

A generalization of Amdahl's law and relative conditions of parallelism

MATH 2710: NOTES FOR ANALYSIS

On Line Parameter Estimation of Electric Systems using the Bacterial Foraging Algorithm

ECE 534 Information Theory - Midterm 2

Shadow Computing: An Energy-Aware Fault Tolerant Computing Model

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

A Parallel Algorithm for Minimization of Finite Automata

Eigenanalysis of Finite Element 3D Flow Models by Parallel Jacobi Davidson

Feedback-error control

Elliptic Curves and Cryptography

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding

Performance Evaluation of Codes. Performance Metrics

Theory of Parallel Hardware May 11, 2004 Massachusetts Institute of Technology Charles Leiserson, Michael Bender, Bradley Kuszmaul

Elementary Analysis in Q p

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

EE 508 Lecture 13. Statistical Characterization of Filter Characteristics

A randomized sorting algorithm on the BSP model

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY

Multi-Operation Multi-Machine Scheduling

Lecture 17: Analytical Modeling of Parallel Programs: Scalability CSCE 569 Parallel Computing

Chapter 7 Rational and Irrational Numbers

State Estimation with ARMarkov Models

Section 0.10: Complex Numbers from Precalculus Prerequisites a.k.a. Chapter 0 by Carl Stitz, PhD, and Jeff Zeager, PhD, is available under a Creative

Round-off Errors and Computer Arithmetic - (1.2)

CMSC 425: Lecture 4 Geometry and Geometric Programming

Session 5: Review of Classical Astrodynamics

An Improved Calibration Method for a Chopped Pyrgeometer

Robust Predictive Control of Input Constraints and Interference Suppression for Semi-Trailer System

Distributed Rule-Based Inference in the Presence of Redundant Information

Information collection on a graph

arxiv: v1 [physics.data-an] 26 Oct 2012

4. Score normalization technical details We now discuss the technical details of the score normalization method.

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule

A Qualitative Event-based Approach to Multiple Fault Diagnosis in Continuous Systems using Structural Model Decomposition

Approximating min-max k-clustering

HENSEL S LEMMA KEITH CONRAD

Online Appendix to Accompany AComparisonof Traditional and Open-Access Appointment Scheduling Policies

PROFIT MAXIMIZATION. π = p y Σ n i=1 w i x i (2)

Operations Management

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY

STABILITY ANALYSIS TOOL FOR TUNING UNCONSTRAINED DECENTRALIZED MODEL PREDICTIVE CONTROLLERS

8 STOCHASTIC PROCESSES

CSE 599d - Quantum Computing When Quantum Computers Fall Apart

Improved Capacity Bounds for the Binary Energy Harvesting Channel

Universal Finite Memory Coding of Binary Sequences

On the Relationship Between Packet Size and Router Performance for Heavy-Tailed Traffic 1

A General Theory of Computational Scalability Based on Rational Functions

A New Method of DDB Logical Structure Synthesis Using Distributed Tabu Search

5. PRESSURE AND VELOCITY SPRING Each component of momentum satisfies its own scalar-transport equation. For one cell:

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

Parallelism and Locality in Priority Queues. A. Ranade S. Cheng E. Deprit J. Jones S. Shih. University of California. Berkeley, CA 94720

Finite-State Verification or Model Checking. Finite State Verification (FSV) or Model Checking

Convolutional Codes. Lecture 13. Figure 93: Encoder for rate 1/2 constraint length 3 convolutional code.

Lecture 21: Quantum Communication

Network DEA: A Modified Non-radial Approach

CSC165H, Mathematical expression and reasoning for computer science week 12

NUMERICAL AND THEORETICAL INVESTIGATIONS ON DETONATION- INERT CONFINEMENT INTERACTIONS

SIMULATED ANNEALING AND JOINT MANUFACTURING BATCH-SIZING. Ruhul SARKER. Xin YAO

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

Lecture 1.2 Units, Dimensions, Estimations 1. Units To measure a quantity in physics means to compare it with a standard. Since there are many

Deriving Indicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V.

An Analysis of TCP over Random Access Satellite Links

Notes on Instrumental Variables Methods

Energy-aware optimisation for run-time reconfiguration

2-D Analysis for Iterative Learning Controller for Discrete-Time Systems With Variable Initial Conditions Yong FANG 1, and Tommy W. S.

Understanding and Using Availability

Matching Partition a Linked List and Its Optimization

Information collection on a graph

GIVEN an input sequence x 0,..., x n 1 and the

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation

Understanding and Using Availability

Outline. CS21 Decidability and Tractability. Regular expressions and FA. Regular expressions and FA. Regular expressions and FA

INTRODUCTION. Please write to us at if you have any comments or ideas. We love to hear from you.

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction


Development of self-adaptively loading for planetary roller traction-drive transmission

3.4 Design Methods for Fractional Delay Allpass Filters

A Study of Active Queue Management for Congestion Control

Model checking, verification of CTL. One must verify or expel... doubts, and convert them into the certainty of YES [Thomas Carlyle]

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

Modeling Pointing Tasks in Mouse-Based Human-Computer Interactions

Practice Final Solutions

Research of PMU Optimal Placement in Power Systems

EXACTLY PERIODIC SUBSPACE DECOMPOSITION BASED APPROACH FOR IDENTIFYING TANDEM REPEATS IN DNA SEQUENCES

Chapter 1 Fundamentals

General Linear Model Introduction, Classes of Linear models and Estimation

On Load Shedding in Complex Event Processing

New Schedulability Test Conditions for Non-preemptive Scheduling on Multiprocessor Platforms

Unit 1 - Computer Arithmetic

Statistics II Logistic Regression. So far... Two-way repeated measures ANOVA: an example. RM-ANOVA example: the data after log transform

MODEL-BASED MULTIPLE FAULT DETECTION AND ISOLATION FOR NONLINEAR SYSTEMS

FE FORMULATIONS FOR PLASTICITY

Thermal Propellant Gauging System for BSS 601

16. Binary Search Trees

Homework Set #3 Rates definitions, Channel Coding, Source-Channel coding

AP Physics C: Electricity and Magnetism 2004 Scoring Guidelines

Transcription:

Performance Analysis Introduction Analysis of execution time for arallel algorithm to dertmine if it is worth the effort to code and debug in arallel Understanding barriers to high erformance and redict imrovement Goal: to figure out whether a rogram merits arallelization Seedu and efficiency Seedu Exectation is for the arallel code to run faster than the sequential counterart Ratio of sequential execution time to arallel execution time Execution time comonents Inherently sequential comutations: σ(n) Potentially arallel comutations: ϕ(n) Communication oerations and other reeat comutations: κ(n, ) Seedu ψ(n, ) solving a roblem of size n on rocessors On sequential comuter Only one comutation at a time No interrocess communication and other overheads Time required is given by On arallel comuter Cannot do anything about inherently sequential comutation In best case, comutation is equally divided between PEs Also need to add time for interrocess communication among PEs Time required is given by + κ(n, ) Parallel execution time will be larger if it cannot be erfectly divided among rocessors, leading to smaller seedu Actual seedu will be limited by ψ(n, ) + κ(n, ) Adding rocessors reduces comutation time but increases communication time At some oint, communication time increase is larger than decrease in comutation time Elbow out oint The seedu begins to decline with additional PEs

Performance Analysis 2 Comutation time Communication time Efficiency Measure of rocessor utilization Ratio of seedu to number of rocessors used Efficiency ε(n, ) for a roblem of size n on rocessors is given by ε(n, ) ε(n, ) σ(n)+ϕ(n) ) (σ(n)+ ϕ(n) +κ(n,) σ(n)+ϕ(n) σ(n)+ϕ(n)+κ(n,)) It is easy to see that 0 ε(n, ) Amdahl s law Consider the exression for seedu ψ(n, ) above Since κ(n, ) > 0 ψ(n, ) + κ(n, ) This gives us the maximum ossible seedu in the absence of communication overhead Let f denote the inherently sequential fraction of the comutation f = σ(n)

Performance Analysis 3 Then, we have ψ(n, ) σ(n)/f σ(n) + σ(n)(/f )/ /f + (/f )/ f + ( f)/ Definition Let f be the fraction of oerations in a comutation that must be erformed sequentially, where 0 f. The maximum seedu ψ achievable by a arallel comuter with rocessors erforming the comutation is ψ f + ( f)/ Based on the assumtion that we are trying to solve a roblem of fixed size as quickly as ossible Provides an uer bound on the seedu achievable with a given number of rocessors, trying to solve the roblem in arallel Also useful to determine the asymtotic seedu as the number of rocessors increases Examle Determine whether it is worthwhile to develo a arallel version of a rogram to solve a articular roblem 95% of rogram s execution occurs inside a loo that can be executed in arallel Maximum seedu exected from a arallel version of rogram executing on 8 CPUs Exect a seedu of 5.9 or less Examle 2 ψ 0.05 + ( 0.05)/8 5.9 20% of a rogram s execution time is sent within inherently sequential code Limit to the seedu achieveable by a arallel version of the rogram Examle 3 ψ = lim 0.2 + ( 0.2)/ = 0.2 = 5 Parallel version of a sequential rogram with time comlexity Θ(n 2 ), n being the size of dataset Time needed to inut dataset and outut result (sequential ortion of code): (8000 + n) µs Comutational ortion can be executed in arallel in time (n 2 /00) µs Maximum seedu on a roblem of size 0,000 ψ 28000 + 000000 28000 + 000000/

Performance Analysis 4 Limitations of Amdahl s law Ignores communication overhead κ(n, ) Assume that the arallel version of rogram in Examle 3 has log n communication oints Communication time at each of these oints: 0000 log + (n/0) µs For a roblem with n = 0000, the total communication time is 4(0000 log + 000)µs The communication time takes into account lg 0000 = 4 for transmission across the net Now, the seedu is given by ψ 28000 + 000000 42000 + 000000/ + 40000 log Ignoring communication time results in overestimation of seedu The Amdahl effect Tyically, κ(n, ) has lower comlexity than ϕ(n) In the hyothetical case of Examle 3 As n increases, ϕ(n)/ dominates κ(n, ) For a fixed, as n increases, seedu increases Review of Amdahl s law Treats roblem size as a constant κ(n, ) = Θ(n log n + n log ) ϕ(n) = Θ(n 2 ) Shows how execution time decreases as number of rocessors increase Another ersective We often use faster comuters to solve larger roblem instances Treat time as a constant and allow roblem size to increase with number of CPUs Gustafson-Barsis s law Assumtion in Amdahl s Law: Minimizing execution time is the focus of arallel comuting Assumes constant roblem size and demonstrates how increasing PEs can reduce time User may want to use arallelism to imrove the accuracy of the result Treat time as a constant and let the roblem size increase with number of rocessors Solve a roblem to higher accuracy within the available time by increasing the number of PEs Sequential fraction of comutation may decrease with increase in roblem size (Amdahl effect) Increasing number of PEs allows for increase in roblem size, decreasing inherently sequential fraction of a comutation, and increasing the quotient between serial and arallel execution times Consider seedu exression again (κ(n, ) 0) ψ(n, )

Performance Analysis 5 Let s be the fraction sent in arallel comutation erforming inherently sequential oerations Pure arallel comutations can be given as s s = s = σ(n) ϕ(n)/ This leads to Now, seedu is given by σ(n) = ϕ(n) = ψ(n, ) ( ) s ( ) ( s) ( s + ( s) + ( )s ) (s + ( s)) Definition 2 Given a arallel rogram solving a roblem of size n using rocessors, let s denote the fraction of total execution time sent in serial code. The maximum seedu ψ achievable by this rogram is Predicts scaled seedu ψ + ( )s Amdahl s law looks at serial comutation and redicts how much faster it will be on multile rocessors It does not scale the availability of comuting ower as the number of PEs increase Gustafson-Barsis s law begins with arallel comutation and estimates the seedu comared to a single rocessor In many cases,assuming a single rocessor is only times slower than rocessors is overly otimistic Solve a roblem on arallel comuter with 6 PEs, each with GB of local memory Let the dataset occuy 5GB, and aggregate memory of arallel comuter is barely large enough to hold the dataset and multile coies of rogram On a single rocessor machine, the entire dataset may not fit in the rimary memory If working set size of rocess exceeded GB, it may begin to thrash, taking more than 6 times as long to execute the arallel ortion of the code as the grou of 6 PEs Effectively, in Gustafson-Bassis s Law, seedu is the time required by arallel rogram divided into the time that would be required to solve the same roblem on a single CPU, if it had sufficient memory Seedu redicted by Gustafson-Barsis s Law is called scaled seedu Using the arallel comutation as the starting oint, rather than the sequential comutation, it allows the roblem size to be an increasing function of the number of PEs A driving metahor

Performance Analysis 6 Amdahl s Law: Suose a car is traveling between two cities 60 miles aart, and has already sent one hour traveling half the distance at 30 mh; no matter how fast you drive the last half, it is imossible to achieve 90 mh average before reaching the second city. Since it has already taken you one hour and you have a distance of 60 miles total, going infinitely fast you would only achieve 60 mh. Gustafson s Law: Suose a car has already beeen traveling for some time at less than 90 mh. Given enough time and distance to travel, the car s average seed can always eventually reach 90 mh, no matter how long and how slowly it has already traveled. For examle, if the car sent one hour at 30 mh, it could achieve this by driving at 20 mh for two additional hours, or at 50 mh for an hour, and so on. Examle An alication running on 0 rocessors sends 3% of its time in serial code Scaled seedu of this alication Examle 2 Desired scaled seedu of 7 on 8 rocessors ψ = 0 + ( 0)(0.03) = 0 0.27 = 9.73 Maximum fraction of rogram s arallel execution time that can be sent in serial code Examle 3 7 = 8 + ( 8)s s 0.4 Vicki lans to justify her urchase of a $30 million Gadzooks suercomuter by demonstrating its 6,384 rocessors can achieve a scaled seedu of 5,000 on a roblem of great imortance to her emloyer. What is the maximum fraction of the arallel execution time that can be devoted to inherently sequential oerations if her alication is to achieve this goal? Alication in research 5000 6384 + ( 6384)s s 384/6383 0.084 Amdahl s Law assumes that the comuting requirements will stay the same, given increased rocessing ower Analysis of the same data will take less time with more comuting ower Gustafson s Law argues that more comuting ower will cause the data to be more carefully and fully analyzed at a finer scale (ixel by ixel or unit by unit) than on a larger scale Simulating imact of nuclear detonation on every building, car, and their contents Kar-Flatt metric Amdahl s law and Gustafson-Barsis s law ignore communication overhead Results in overestimation of seedu or scaled seedu Exerimentally determined serial fraction to get erformance insight Execution time of arallel rogram on rocessors T (n, ) = + κ(n, )

Performance Analysis 7 No communication overhead for serial rogram T (n, ) = Exerimentally determined serial fraction e of arallel comutation is the total amount of idle and overhead time scaled by two factors: the number of PEs (less one) and the sequential execution time e = ( )σ(n) + κ(n, ) ( )T (n, ) Now, we have Rewrite the arallel execution time as T (n, ) T (n, ) e = ( )T (n, ) T (n, ) = (T (n, )e + T (n, )( e)/ Use ψ as a shorthand for ψ(n, ) ψ = T (n, ) T (n, ) T (n, ) = T (n, )ψ Now, comute time for rocessors as T (n, ) = T (n, )ψe + T (n, )ψ( e)/ = ψe + ψ( e)/ ψ = e + e = e + e ( = e ) + e = ψ Definition 3 Given a arallel comutation exhibiting seedu ψ on rocessors, where >, the exerimentally determined serial fraction e is defined to be ψ e = Useful metric for two reasons. Takes into account arallel overhead (κ(n, )) 2. Detects other sources of overhead or inefficiency ignored in seedu model Process startu time Process synchronization Imbalanced workload Architectural overhead Hels to determine whether the efficiency decrease in arallel comutation is due to limited oortunities for arallelism or the increase in algorithmic/architectural overhead

Performance Analysis 8 Examle Look at the following seedu table with rocessors 2 3 4 5 6 7 8 ψ.82 2.50 3.08 3.57 4.00 4.38 4.7 What is the rimary reason for seedu of only 4.7 on 8 CPUs? Comute e e 0. 0. 0. 0. 0. 0. 0. Since e is constant, the rimary reason is large serial fraction in the code; limited oortunity for arallelism Examle 2 Look at the following seedu table with rocessors 2 3 4 5 6 7 8 ψ.87 2.6 3.23 3.73 4.4 4.46 4.7 What is the rimary reason for seedu of only 4.7 on 8 CPUs? Comute e e 0.070 0.075 0.080 0.085 0.090 0.095 0.00 Since e is steadily increasing, the rimary reason is arallel overhead; time sent in rocess startu, communication, or syncronization, or architectural constraint Isoefficiency metric Parallel system: A arallel rogram executing on a arallel comuter Scalability: A measure of arallel system s ability to increase erformance as number of rocessors increase A scalable system maintains efficiency as rocessors are added Amdahl effect Seedu and efficiency are tyically an increasing function of roblem size because of communication comlexity being lower than comutational comlexity The roblem size can be increased to maintain the same level of efficiency when rocessors are added Isoefficiency: way to measure scalability Isoefficiency derivation stes Begin with seedu formula Comute total amount of overhead Assume efficiency remains constant Determine relation between sequential execution time and overhead Deriving isoefficiency relation ψ(n, ) + κ(n, ) () + κ(n, ) () + ( )σ(n) + κ(n, )

Performance Analysis 9 Let T 0 (n, ) be the total time sent by all rocessors doing work not done by sequential algorithm; it includes Time sent by rocessors executing inherently sequential code Time sent by all rocessors erforming communication and redundant comutations Substituting for T 0 (n, ) in seedu equation Efficiency equals seedu divided by T 0 (n, ) = ( )σ(n) + κ(n, ) ψ(n, ) () + T 0 (n, ) ε(n, ) Substituting T (n, ) =, we have + T 0 (n, ) + T0(n,) σ(n)+ϕ(n) ε(n, ) T 0 (n, ) T (n, ) T (n, ) + T0(n,) T (n,) ε(n, ) ε(n, ) ε(n, ) ε(n, ) T 0(n, ) For constant level of efficiency, the fraction ε(n,) ε(n,) is a constant This yields the isoefficiency relation as T (n, ) CT 0 (n, ) Definition 4 Suose a arallel system exhibits efficiency ε(n, ). Define C = ε(n,) ε(n,) and T 0(n, ) = ( )σ(n) + κ(n, ). In order to maintain the same level of efficiency as the number of rocessors increases, n must be increased so that the following inequality is satisfied: Isoefficiency relation T (n, ) CT 0 (n, ) Used to determine the range of rocessors for which a given level of efficiency can be maintained Parallel overhead increases with number of rocessors Efficiency can be maintained by increasing the size of the roblem being solved Scalability function All algorithms assume that the maximum roblem size is limited by the amount of rimary memory available Treat sace as the limiting factor in the analysis Assume that the arallel system has isoefficiency relation as n f() M(n) denotes the amount of memory needed to store a roblem of size n Relation M (n) f() indicates how the amount of memory used must increase as a function of to maintain a constant level of efficiency Rewrite the relation as n M(f()) Total amount of memory available is a linear function of the number of rocessors

Performance Analysis 0 Amount of memory er rocessor must increase by M(f())/ to maintain the same level of efficiency Scalability function is: M(f())/ Comlexity of scalability function determines the range of rocessors for which a constant level of efficiency can be maintained If M(f())/ = Θ(), memory requirement er rocessor is constant and the system is erfectly scalable If M(f())/ = Θ(), memory requirement er rocessor increases linearly with the number of rocessors While memory is available, it is ossible to maintain same level of ε by increasing the roblem size But memory used er rocessor increases linearly with ; at some oint it may reach the memory caacity of the system Limits efficiency when the number of rocessors increase to a level Similar arguments for the cases when M(f())/ = Θ(log ) and M(f())/ = Θ( log ) In general, lower the comlexity of M(f())/, higher the scalability of the arallel system Examle Reduction Comutational comlexity of sequential reduction algorithm Θ(n) Reduction ste time comlexity in arallel algorithm Θ(log ) Since every PE articiates in this ste, T 0 (n, ) = Θ( log ) Isoefficiency relation for reduction algorithm (with a constant C) n C log Since sequential algorithm reduces n values, M(n) = n, leading to Reduction of n values on PEs M(C log )/ = C log / = C log Each PE adds about n/ values and articiates in the reduction that has log stes If we double the values of n and, each PE still adds about n/ values, but reduction time is increased to log(2) ; this dros efficiency To maintain the same level of efficiency, n has to be greater than double with twice the number of PEs Examle Floyd s algorithm Sequential algorithm has time comlexity Θ(n 3 ) Each of PEs sends Θ(n 2 log ) time in communications Isoefficiency relation n 3 C(n 2 log ) n C log Consider the memory requirements along with roblem size n Amount of storage needed to reresent a roblem of size n is n 2, or M(n) = n 2 The scalability function is Poor scalability comared to arallel reduction M(C log )/ = C 2 2 log 2 / = C 2 log 2