Analysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel

Performance Analysis Introduction Analysis of execution time for arallel algorithm to dertmine if it is worth the effort to code and debug in arallel Understanding barriers to high erformance and redict imrovement Goal: to figure out whether a rogram merits arallelization Seedu and efficiency Seedu Exectation is for the arallel code to run faster than the sequential counterart Ratio of sequential execution time to arallel execution time Execution time comonents Inherently sequential comutations: σ(n) Potentially arallel comutations: ϕ(n) Communication oerations and other reeat comutations: κ(n, ) Seedu ψ(n, ) solving a roblem of size n on rocessors On sequential comuter Only one comutation at a time No interrocess communication and other overheads Time required is given by On arallel comuter Cannot do anything about inherently sequential comutation In best case, comutation is equally divided between PEs Also need to add time for interrocess communication among PEs Time required is given by + κ(n, ) Parallel execution time will be larger if it cannot be erfectly divided among rocessors, leading to smaller seedu Actual seedu will be limited by ψ(n, ) + κ(n, ) Adding rocessors reduces comutation time but increases communication time At some oint, communication time increase is larger than decrease in comutation time Elbow out oint The seedu begins to decline with additional PEs

Performance Analysis 2 Comutation time Communication time Efficiency Measure of rocessor utilization Ratio of seedu to number of rocessors used Efficiency ε(n, ) for a roblem of size n on rocessors is given by ε(n, ) ε(n, ) σ(n)+ϕ(n) ) (σ(n)+ ϕ(n) +κ(n,) σ(n)+ϕ(n) σ(n)+ϕ(n)+κ(n,)) It is easy to see that 0 ε(n, ) Amdahl s law Consider the exression for seedu ψ(n, ) above Since κ(n, ) > 0 ψ(n, ) + κ(n, ) This gives us the maximum ossible seedu in the absence of communication overhead Let f denote the inherently sequential fraction of the comutation f = σ(n)

Performance Analysis 3 Then, we have ψ(n, ) σ(n)/f σ(n) + σ(n)(/f )/ /f + (/f )/ f + ( f)/ Definition Let f be the fraction of oerations in a comutation that must be erformed sequentially, where 0 f. The maximum seedu ψ achievable by a arallel comuter with rocessors erforming the comutation is ψ f + ( f)/ Based on the assumtion that we are trying to solve a roblem of fixed size as quickly as ossible Provides an uer bound on the seedu achievable with a given number of rocessors, trying to solve the roblem in arallel Also useful to determine the asymtotic seedu as the number of rocessors increases Examle Determine whether it is worthwhile to develo a arallel version of a rogram to solve a articular roblem 95% of rogram s execution occurs inside a loo that can be executed in arallel Maximum seedu exected from a arallel version of rogram executing on 8 CPUs Exect a seedu of 5.9 or less Examle 2 ψ 0.05 + ( 0.05)/8 5.9 20% of a rogram s execution time is sent within inherently sequential code Limit to the seedu achieveable by a arallel version of the rogram Examle 3 ψ = lim 0.2 + ( 0.2)/ = 0.2 = 5 Parallel version of a sequential rogram with time comlexity Θ(n 2 ), n being the size of dataset Time needed to inut dataset and outut result (sequential ortion of code): (8000 + n) µs Comutational ortion can be executed in arallel in time (n 2 /00) µs Maximum seedu on a roblem of size 0,000 ψ 28000 + 000000 28000 + 000000/

Performance Analysis 4 Limitations of Amdahl s law Ignores communication overhead κ(n, ) Assume that the arallel version of rogram in Examle 3 has log n communication oints Communication time at each of these oints: 0000 log + (n/0) µs For a roblem with n = 0000, the total communication time is 4(0000 log + 000)µs The communication time takes into account lg 0000 = 4 for transmission across the net Now, the seedu is given by ψ 28000 + 000000 42000 + 000000/ + 40000 log Ignoring communication time results in overestimation of seedu The Amdahl effect Tyically, κ(n, ) has lower comlexity than ϕ(n) In the hyothetical case of Examle 3 As n increases, ϕ(n)/ dominates κ(n, ) For a fixed, as n increases, seedu increases Review of Amdahl s law Treats roblem size as a constant κ(n, ) = Θ(n log n + n log ) ϕ(n) = Θ(n 2 ) Shows how execution time decreases as number of rocessors increase Another ersective We often use faster comuters to solve larger roblem instances Treat time as a constant and allow roblem size to increase with number of CPUs Gustafson-Barsis s law Assumtion in Amdahl s Law: Minimizing execution time is the focus of arallel comuting Assumes constant roblem size and demonstrates how increasing PEs can reduce time User may want to use arallelism to imrove the accuracy of the result Treat time as a constant and let the roblem size increase with number of rocessors Solve a roblem to higher accuracy within the available time by increasing the number of PEs Sequential fraction of comutation may decrease with increase in roblem size (Amdahl effect) Increasing number of PEs allows for increase in roblem size, decreasing inherently sequential fraction of a comutation, and increasing the quotient between serial and arallel execution times Consider seedu exression again (κ(n, ) 0) ψ(n, )

Performance Analysis 5 Let s be the fraction sent in arallel comutation erforming inherently sequential oerations Pure arallel comutations can be given as s s = s = σ(n) ϕ(n)/ This leads to Now, seedu is given by σ(n) = ϕ(n) = ψ(n, ) ( ) s ( ) ( s) ( s + ( s) + ( )s ) (s + ( s)) Definition 2 Given a arallel rogram solving a roblem of size n using rocessors, let s denote the fraction of total execution time sent in serial code. The maximum seedu ψ achievable by this rogram is Predicts scaled seedu ψ + ( )s Amdahl s law looks at serial comutation and redicts how much faster it will be on multile rocessors It does not scale the availability of comuting ower as the number of PEs increase Gustafson-Barsis s law begins with arallel comutation and estimates the seedu comared to a single rocessor In many cases,assuming a single rocessor is only times slower than rocessors is overly otimistic Solve a roblem on arallel comuter with 6 PEs, each with GB of local memory Let the dataset occuy 5GB, and aggregate memory of arallel comuter is barely large enough to hold the dataset and multile coies of rogram On a single rocessor machine, the entire dataset may not fit in the rimary memory If working set size of rocess exceeded GB, it may begin to thrash, taking more than 6 times as long to execute the arallel ortion of the code as the grou of 6 PEs Effectively, in Gustafson-Bassis s Law, seedu is the time required by arallel rogram divided into the time that would be required to solve the same roblem on a single CPU, if it had sufficient memory Seedu redicted by Gustafson-Barsis s Law is called scaled seedu Using the arallel comutation as the starting oint, rather than the sequential comutation, it allows the roblem size to be an increasing function of the number of PEs A driving metahor

Performance Analysis 6 Amdahl s Law: Suose a car is traveling between two cities 60 miles aart, and has already sent one hour traveling half the distance at 30 mh; no matter how fast you drive the last half, it is imossible to achieve 90 mh average before reaching the second city. Since it has already taken you one hour and you have a distance of 60 miles total, going infinitely fast you would only achieve 60 mh. Gustafson s Law: Suose a car has already beeen traveling for some time at less than 90 mh. Given enough time and distance to travel, the car s average seed can always eventually reach 90 mh, no matter how long and how slowly it has already traveled. For examle, if the car sent one hour at 30 mh, it could achieve this by driving at 20 mh for two additional hours, or at 50 mh for an hour, and so on. Examle An alication running on 0 rocessors sends 3% of its time in serial code Scaled seedu of this alication Examle 2 Desired scaled seedu of 7 on 8 rocessors ψ = 0 + ( 0)(0.03) = 0 0.27 = 9.73 Maximum fraction of rogram s arallel execution time that can be sent in serial code Examle 3 7 = 8 + ( 8)s s 0.4 Vicki lans to justify her urchase of a $30 million Gadzooks suercomuter by demonstrating its 6,384 rocessors can achieve a scaled seedu of 5,000 on a roblem of great imortance to her emloyer. What is the maximum fraction of the arallel execution time that can be devoted to inherently sequential oerations if her alication is to achieve this goal? Alication in research 5000 6384 + ( 6384)s s 384/6383 0.084 Amdahl s Law assumes that the comuting requirements will stay the same, given increased rocessing ower Analysis of the same data will take less time with more comuting ower Gustafson s Law argues that more comuting ower will cause the data to be more carefully and fully analyzed at a finer scale (ixel by ixel or unit by unit) than on a larger scale Simulating imact of nuclear detonation on every building, car, and their contents Kar-Flatt metric Amdahl s law and Gustafson-Barsis s law ignore communication overhead Results in overestimation of seedu or scaled seedu Exerimentally determined serial fraction to get erformance insight Execution time of arallel rogram on rocessors T (n, ) = + κ(n, )

Performance Analysis 7 No communication overhead for serial rogram T (n, ) = Exerimentally determined serial fraction e of arallel comutation is the total amount of idle and overhead time scaled by two factors: the number of PEs (less one) and the sequential execution time e = ( )σ(n) + κ(n, ) ( )T (n, ) Now, we have Rewrite the arallel execution time as T (n, ) T (n, ) e = ( )T (n, ) T (n, ) = (T (n, )e + T (n, )( e)/ Use ψ as a shorthand for ψ(n, ) ψ = T (n, ) T (n, ) T (n, ) = T (n, )ψ Now, comute time for rocessors as T (n, ) = T (n, )ψe + T (n, )ψ( e)/ = ψe + ψ( e)/ ψ = e + e = e + e ( = e ) + e = ψ Definition 3 Given a arallel comutation exhibiting seedu ψ on rocessors, where >, the exerimentally determined serial fraction e is defined to be ψ e = Useful metric for two reasons. Takes into account arallel overhead (κ(n, )) 2. Detects other sources of overhead or inefficiency ignored in seedu model Process startu time Process synchronization Imbalanced workload Architectural overhead Hels to determine whether the efficiency decrease in arallel comutation is due to limited oortunities for arallelism or the increase in algorithmic/architectural overhead

Performance Analysis 8 Examle Look at the following seedu table with rocessors 2 3 4 5 6 7 8 ψ.82 2.50 3.08 3.57 4.00 4.38 4.7 What is the rimary reason for seedu of only 4.7 on 8 CPUs? Comute e e 0. 0. 0. 0. 0. 0. 0. Since e is constant, the rimary reason is large serial fraction in the code; limited oortunity for arallelism Examle 2 Look at the following seedu table with rocessors 2 3 4 5 6 7 8 ψ.87 2.6 3.23 3.73 4.4 4.46 4.7 What is the rimary reason for seedu of only 4.7 on 8 CPUs? Comute e e 0.070 0.075 0.080 0.085 0.090 0.095 0.00 Since e is steadily increasing, the rimary reason is arallel overhead; time sent in rocess startu, communication, or syncronization, or architectural constraint Isoefficiency metric Parallel system: A arallel rogram executing on a arallel comuter Scalability: A measure of arallel system s ability to increase erformance as number of rocessors increase A scalable system maintains efficiency as rocessors are added Amdahl effect Seedu and efficiency are tyically an increasing function of roblem size because of communication comlexity being lower than comutational comlexity The roblem size can be increased to maintain the same level of efficiency when rocessors are added Isoefficiency: way to measure scalability Isoefficiency derivation stes Begin with seedu formula Comute total amount of overhead Assume efficiency remains constant Determine relation between sequential execution time and overhead Deriving isoefficiency relation ψ(n, ) + κ(n, ) () + κ(n, ) () + ( )σ(n) + κ(n, )

Performance Analysis 9 Let T 0 (n, ) be the total time sent by all rocessors doing work not done by sequential algorithm; it includes Time sent by rocessors executing inherently sequential code Time sent by all rocessors erforming communication and redundant comutations Substituting for T 0 (n, ) in seedu equation Efficiency equals seedu divided by T 0 (n, ) = ( )σ(n) + κ(n, ) ψ(n, ) () + T 0 (n, ) ε(n, ) Substituting T (n, ) =, we have + T 0 (n, ) + T0(n,) σ(n)+ϕ(n) ε(n, ) T 0 (n, ) T (n, ) T (n, ) + T0(n,) T (n,) ε(n, ) ε(n, ) ε(n, ) ε(n, ) T 0(n, ) For constant level of efficiency, the fraction ε(n,) ε(n,) is a constant This yields the isoefficiency relation as T (n, ) CT 0 (n, ) Definition 4 Suose a arallel system exhibits efficiency ε(n, ). Define C = ε(n,) ε(n,) and T 0(n, ) = ( )σ(n) + κ(n, ). In order to maintain the same level of efficiency as the number of rocessors increases, n must be increased so that the following inequality is satisfied: Isoefficiency relation T (n, ) CT 0 (n, ) Used to determine the range of rocessors for which a given level of efficiency can be maintained Parallel overhead increases with number of rocessors Efficiency can be maintained by increasing the size of the roblem being solved Scalability function All algorithms assume that the maximum roblem size is limited by the amount of rimary memory available Treat sace as the limiting factor in the analysis Assume that the arallel system has isoefficiency relation as n f() M(n) denotes the amount of memory needed to store a roblem of size n Relation M (n) f() indicates how the amount of memory used must increase as a function of to maintain a constant level of efficiency Rewrite the relation as n M(f()) Total amount of memory available is a linear function of the number of rocessors

Performance Analysis 0 Amount of memory er rocessor must increase by M(f())/ to maintain the same level of efficiency Scalability function is: M(f())/ Comlexity of scalability function determines the range of rocessors for which a constant level of efficiency can be maintained If M(f())/ = Θ(), memory requirement er rocessor is constant and the system is erfectly scalable If M(f())/ = Θ(), memory requirement er rocessor increases linearly with the number of rocessors While memory is available, it is ossible to maintain same level of ε by increasing the roblem size But memory used er rocessor increases linearly with ; at some oint it may reach the memory caacity of the system Limits efficiency when the number of rocessors increase to a level Similar arguments for the cases when M(f())/ = Θ(log ) and M(f())/ = Θ( log ) In general, lower the comlexity of M(f())/, higher the scalability of the arallel system Examle Reduction Comutational comlexity of sequential reduction algorithm Θ(n) Reduction ste time comlexity in arallel algorithm Θ(log ) Since every PE articiates in this ste, T 0 (n, ) = Θ( log ) Isoefficiency relation for reduction algorithm (with a constant C) n C log Since sequential algorithm reduces n values, M(n) = n, leading to Reduction of n values on PEs M(C log )/ = C log / = C log Each PE adds about n/ values and articiates in the reduction that has log stes If we double the values of n and, each PE still adds about n/ values, but reduction time is increased to log(2) ; this dros efficiency To maintain the same level of efficiency, n has to be greater than double with twice the number of PEs Examle Floyd s algorithm Sequential algorithm has time comlexity Θ(n 3 ) Each of PEs sends Θ(n 2 log ) time in communications Isoefficiency relation n 3 C(n 2 log ) n C log Consider the memory requirements along with roblem size n Amount of storage needed to reresent a roblem of size n is n 2, or M(n) = n 2 The scalability function is Poor scalability comared to arallel reduction M(C log )/ = C 2 2 log 2 / = C 2 log 2