John Weatherwax. Analysis of Parallel Depth First Search Algorithms

Sulementary Discussions and Solutions to Selected Problems in: Introduction to Parallel Comuting by Viin Kumar, Ananth Grama, Anshul Guta, & George Karyis John Weatherwax Chater 8 Analysis of Parallel Deth First Search Algorithms Following the discussion in the book on this toic I will clarify some oints that I had difficulty following. Hoefully, these comments will be of hel to others as they learn this material. In the resented analysis, there is a total of W work (searching) to be done and we initially have all of it assigned to a single rocessor. This is in fact the way most arallel rograms begin so our assumtions are in line with reality. Since we have not secifically secified the load balancing scheme we can be guaranteed that this W amount of work will be slit u and distributed to the remaining rocessors if the load balancing scheme is weakly fair. Intuitively, this means that after enough requests for work in the system (from any rocessor) have been issued eventually every rocessor will have work requested from it. In other words, no rocessor is excluded from being asked to rovide work. To quantify the notion of how many requests must take lace before we can guarantee that every rocessor has had work requested from it. We define V () to be the number of requests that must take lace in the system before we can guarantee that every rocessor has been olled for work. Obviously, V () deends on the secific load balancing scheme used be it global round

robin, randomized olling etc. We see that at least V () since if V () < we have not even generated enough olling to request from every existing rocessor. Secifically V () will deend on the load balancing algorithm and will be secified when we discuss the exlicit load balancing schemes below. In our initial situation, with all the work W at one rocessor, we must have V () requests in our system before we can be guaranteed to to have requested work from. With the first request that accets, the maximal work in the system dros to ( α)w. This is assuming an α-slitting as discussed in the book. After 2V () requests it dros to ( α) 2 W, etc. After nv () total requests it dros to ( α) n W. To determine the number n of V () requests before our tasks are so small they can no longer be slit we enforce the requirement that after these n work divisions, the maximal work left in the system be less than ɛ. This gives ( α) n W < ɛ Solving for n gives n > log /( α) ( W ɛ ) Thus after n chunks of V () requests we can guarantee that the maximal work at any given node in the system is less than ɛ. Thus the total number of work requests is O(nV ()) and since O(n) = O(log /( α) ( W ɛ )) = O(log W ) the total number of work requests simlifies to O(V () log W ) as given in the text. If we further assume that each request for work and the corresond work transfers occuy constant time t comm, an uer bound on the amount of communication overhead can be given by T o = t comm V () log W. For variations on communication that does not take lace in constant time this see the roblems from this chater. A Different Analysis of V() for Random Polling Here is a different method of attack for obtaining the same answer for this question. As stated in the text, assume that we have total boxes to mark and we continue to make random attemts to mark an unmarked box. Define 2

f(q, n) to be the robability that q boxes are marked after n trials. Then the average number of trials needed to mark all of the boxes is given by N = kf(, k) () k= A simle artial difference equation for f(q, n) is given by the following logic. The robability that q + boxes are marked after n + trials can be given by the sum of the two mutually exclusive events q + boxes are marked after n trials and the n + attemt to mark a box fails q boxes are marked after n trials and the n + attemt to mark a box succeeds Now at any stage, with q (of ) boxes marked, the chance we mark another box (a success) on the next trial is q and the chance we do not (a failure) is given by q. With this background we can now exress f(q +, n + ) as f(q +, n + ) = q + (q + ) f(q +, n) + f(q, n). (2) For q 0 and n. Note that this is a linear artial difference equation and has many amenable method for its solution. For initial and boundary conditions for f(q, n) we see from simle arguments that f(0, n) = 0 for n (3) f(q, 0) = 0 for q (4) f(q, n) = 0 for q > n (5) f(, ) = (6) f(q, q) = (q) ( )( 2)... ( q + ) = for q (7) q q We now desire the solution to the artial difference equation, Eq. 2. It will benefit us to write it decremented in q by one or as f(q, n + ) = q q f(q, n) + f(q, n) for n, q. (8) 3

Attemts to solve this artial difference equation are reorted on below using a variety of techniques. Some were more successful that others and may indicate why the book choose the aroach they took. Many reduce to the same calculations. In locations where I couldn t make more rogress, I simly stoed, oting to come back to these derivations when my skills had imroved more. Oerator Method As a first attemt, lets try to solve Eq. 8 using the Oerator Method. As such we define E f(q, n) = f(q +, n) (9) E 2 f(q, n) = f(q, n + ) (0) and our original equation becomes ( q E 2 f = ( q ) )E f () therefore recognizing the above as n iterations on f(q, n) we have (by induction) and then the binomial theorem that ( q f(q, n) = + ( q ) n )E g(q) (2) ( ) ( ) n k ( n q = q ) n k E (n k) g(q) (3) k k=0 ( ) ( ) n k ( n q = q n k g(q n + k) (4) k ) k=0 As an interesting ( ) observation, that may shed light into how to roceed, notice that ( q n k )k ( q )n k is the Binomial distribution. That is, the robability of k successes in n trials where the robability of a single success is q. The Z-transform with resect to n Taking the Z-transform with resect to n gives zf (q, z) zf(q, 0) = q q F (q, z) + F (q, z) for q (5) 4

Now f(q, 0) = 0 when q > 0 so the above can be solved for F (q, z) giving Solving by iteration we obtain F (q, z) = q z q F (q, z) for q (6) F (q, z) = q ( l ) q (z l (0, z) (7) )F Note that F (0, z) = Z(f(0, n)) = n 0 f(0, n)z n = f(0, 0)z 0 = f(0, 0) (8) At this oint we have not determined f(0, 0). Limiting values of Eqs. 3 and 4 above would indicate that its value should be 0. This cannot be true, or else we end u with the trivial solution for F (q, z). In addition, limiting values of the Eq. 7 require the value of f(0, 0) =, which is what we use in the analysis below. To comute the inverse Z-transform of Eq. 7 we note that this exression can be searated into a sum of individual terms by artial fractions. To enable such a transformation we are looking for a set of coefficients A l such that Multilying both sides by A l q q (z l ) = z l ( q z l ) in the standard way we can obtain the following for the coefficients A l (9) A l = q m=,m l ( l ) = m q q m=,m l (l m) for l =, 2,..., q (20) so that with this substitution F (q, z) becomes F (q, z) = ( q ( l ) ) q 5 ( z l l A l ) (2)

Note that the roduct exression in the above equation can be written in terms of the factorial function as q ( l q ) = l) ( (22) = q ( l) (23) q so that at this oint F (q, z) becomes = ( )( 2)... ( q) (24) q = q+ (q+) (25) F (q, z) = (q+) q q A l ( z l l ) (26) To further comute the inverse Z-transform of this exression recall that the Z-transform of the Heavyside unit ste (with jum occurring at ) is given by (see Page 2 in the book by Kelly and Peterson []: Examle 3.37) n 0 u n () = z and recalling the geometric roduct formula for the Z-transform (27) Z(a n y n ) = Y ( z a ) (28) we can invert the above function obtaining ( ) f(q, n) = (q+) q n A l l u q n () (29) l = (q+) q u n() A q+n l l n (30) The Method of Generating Functions In this method we multily both sides of the equation by x n and sum from n 0. This gives f(q, n + )x n = q n 0 n 0 f(q, n)xn + q f(q, n)x n (3) n 0 6

Defining F (q, x) = n 0 f(q, n)x n the above becomes or or n f(q, n)x n = q F (q, x) + ( q )F (q, x) (32) x F (q, x) = q F (q, x) + ( q )F (q, x) (33) F (q, x) = ( q ) x q (34) The Method of Searation of Variables To use Searation of Variables we define f(q, n) = F q G n and insert this into our artial difference above to get Now dividing by G n F q we obtain F q G n+ = q F qg n + ( q )F q G n (35) G n+ = q ( G n + q ) Fq. (36) F q Since the left hand side is a function only n and the right hand side is only a function of q they each must equal a searation constant we denote α. With this substitution the equation for G n then becomes G n+ = αg n (37) with a solution of Similarly F q must satisfy or G n = G 0 α n (38) q + ( q )F q F q = α (39) F q = q q αf q (40) 7

Derivation of a difference equation for N This may yet be another method that could yield manageable calculations. Chater 8 Solutions Problem To evaluate the erformance of this scheme we will follow the discussion given in the book and comute the overhead due to communication i.e. resulting from work requests and work transfers necessitated by the load balancing algorithm; be it asynchronous round robin, global round robin, or random oling. From the the discussions in the book, the communication overhead is given by T 0 = t comm V () log W. (4) Assuming a constant communication time (t comm ) indeendent of the distance between rocessors and the amount of data sent. Problem 4 Part (a): Assume as in the discussion in the text that W amount of work is initially laced at a single rocessor. In V () requests and assuming an α-slitting we reduce the maximum work at any given node to ( α)w. We also require distributing either αw or ( α)w to the requesting rocessor (in a real imlementation where communication bandwidth is limited one would want to transmit the smaller of those two exressions). This transmission requires a communication time roortional to either αw or ( α)w and is bounded above by the larger of these two. Since we assume that 0 < α < 0.5 this is ( α)w. This total negotiation for work results in at most V () requests each with t comm communication time and the additional transfer of work which is bounded above by t w ( α)w. Giving a total communication overhead of T () o = t comm V () + t w ( α)w. (42) A second alication of the same logic results in T (2) o = T () o + t comm V () + t w ( α) 2 W (43) 8

= 2t comm V () + t w ( ( α)w + ( α) 2 W ) (44) Induction to multile blocks of work requests on the above attern (until the work at all rocessors is less than the minimum slit amount ɛ) gives for the total overhead due to communication (transfer of data lus requests) of n T o = t comm V () log W + t w W ( α) i/2 (45) Where as in the notes for this section n > log /( α) ( W ɛ ). To determine the total communication overhead we must erform the summation in Eq. 45. The simlest way to evaluate this exression is to consider the limit when n tends to infinity. In that case, the summation above is a geometric series and converges nicely to a constant. Giving in total that the total overhead is given by i= T 0 = t comm V () log(w ) + O(t w W ) (46) Now Eq. 46 reresents the communication required by the load balancing strategy through its deendence on V (). For load balancing by Global Round Robin (GRR) the function V () = since we must have at least work requests in the system before we can be gaurrenteed that every rocessor has received at least one request and Eq. 46 becomes T 0 = t comm log(w ) + C t w W + C0 (47) For the hyercube architecture the average distance between rocessors is given by Θ(log()) so that the above becomes T 0 = O( log() log(w )) + O(t w W ). (48) To derive the iso-efficiency function we ballance the communication overhead T 0 with the total work to be done W giving W = O( log() log(w )) + O( W ) (49) and we desire to solve the above asymtotically for W = W (). Asymtotically, W W, and we dro the O( W ) term from the above giving W log() log(w ) when W 9

giving W = O( log() log( log( log(w )))) (50) = O( log((log() + log(log()) + log(w )))) (5) O( log() 2 ) (52) As discussed in the text for global round robin, the iso-efficiency due to contention for the shared resource counter rocessor is O( 2 log()) and this dominates the iso-efficiency above. Part (b): In this case, after the system has seen V () work requests the maximal work at any one rocessor is reduced to W ( α), after two blocks of V () work requests the maximal work in any one rocessor is given by W ( α) 2. As before we have, after n blocks of V () work requests the maximal work in any one rocessor is given by W ( α) n. The assumtions of this art of the roblem assume that the communication time for the transfer of work is uer bounded by log(w ( α) n ). (53) So in total, the total time required for the data transfered is bounded above by n t w log(w ( α) i ) = n t w log(w ) + i log( α) (54) i= i= = n t w n log(w ) + t w log( α) i (55) i= with t w the er word data transfer time (time required to transfer a word of data). Since n = O(log(W )) as discussed above we observe that the above is bounded above by t w (log(w )) 2 + t w log( α)n 2 = O(log(W ) 2 ). (56) Thus the overhead due to communication of work requests is T 0 = t comm V () log(w ) + (log(w )) 2 (57) since for Global Round Robin V () = and we obtain T 0 = t comm log(w ) + (log(w )) 2 (58) 0

For the hyercube architecture t comm = log() and we have T 0 = log() log(w ) + (log(w )) 2. (59) To derive the iso-efficency function set T 0 = W and solve for W as a function of. The assignment of T 0 = W gives W = log() log(w ) + (log(w )) 2 (60) Since (log(w )) 2 is subdominant to W (the left hand side) we obtain equating W with the log() term the exression W = log() log(w ) (6) which is the same result as in Part (a) and gives an iso-efficiency function of W = O((log()) 2 as before. Since for both of these situations the communication time due to data transfer did not affect the iso-efficiency function we see a motivation for not including it in the discussion resented in the book. References [] W. G. Kelley and A. C. Peterson. Difference Equations. An Introduction with Alications. Academic Press, New York, 99.