Unque Sets Orented Parallelzaton of Loops wth Non-unform Dependences Example : Example : Example : do =, do =, do =, do =, do =, do =, A( + ; ) = A( +

Size: px

Start display at page:

Download "Unque Sets Orented Parallelzaton of Loops wth Non-unform Dependences Example : Example : Example : do =, do =, do =, do =, do =, do =, A( + ; ) = A( +"

Catherine Norris
5 years ago
Views:

1 Unque Sets Orented Parallelzaton of Loops wth Non-unform Dependences Jaln Ju and Vpn Chaudhary Parallel and Dstrbuted Computng Laboratory, Wayne State Unversty, Detrot, MI, USA Emal: Although many methods exst for nested loop parttonng, most of them perform poorly when parallelzng loops wth non-unform dependences. Ths paper addresses the ssue of automatc parallelzaton of loops wth non-unform dependences. Such loops normally are not parallelzed by exstng parallelzng complers and transformatons. Even when parallelzed n rare nstances, the performance s very poor. Our approach s based on the Convex Hull theory whch has adequate nformaton to handle non-unform dependences. We ntroduce the concept of Complete Dependence Convex Hull, Unque Head and Tal Sets and abstract the dependence nformaton nto these sets. These sets form the bass of the teraton space parttons. The propertes of the unque head and tal sets are derved. Dependng on the relatve placement of these unque sets, parttonng schemes are suggested for mplementaton of our technque. Implementaton results of our scheme on the Cray J9 and comparson wth other schemes show the superorty of our technque. Receved November, 99; revsed July, 997. INTRODUCTION Gven a sequental program, a challengng problem for parallelzng complers s to detect maxmum parallelsm. It s generally agreed upon and shown n the study by Kuck et. al. that most of the computaton tme s spent n loops. Current parallelzng complers concentrate on loop parallelzaton. A loop can be easly parallelzed f there are no cross-teraton dependences. However, loops wth cross-teraton dependences are very common. Parallelzng loops wth crossteraton dependences s a maor concern facng parallelzng complers today. Loops wth cross-teraton dependences can be roughly dvded nto two groups. One s loops wth statc regular dependences, whch can be analyzed durng comple tme. Example, n Fgure belong to ths group. The other group s loops wth dynamc rregular dependences, whch have ndrect access patterns. Example shows a typcal rregular loop, whch s used for edge-orented representaton of sparse matrces. These knd of loops cannot be parallelzed at comple tme, for lack of sucent nformaton. To execute such loop ecently n parallel, runtme support must be provded. The maor ob of parallelzng complers s to parallelze loops wth statc regular dependences. Statc regular loops can be further dvded nto two sub-groups. One s wth unform dependences and the other s wth non-unform dependences. The dependences are unform only when the patterns of dependence vectors are unform. In other words, the dependence vectors can be expressed by constants,.e., dstance vectors. Example llustrates a unform dependence loop. Its dependence vectors are (, ) and (, -). Fgure shows the dependence patterns of Example n the teraton space. In the same fashon, we call some dependences non-unform when dependence vectors are n rregular patterns whch cannot be expressed by dstance vectors. Fgure shows the dependence patterns of Example n the teraton space. A lot of research has been done n parallelzng loops wth unform dependences, from dependence analyss to loop transformaton, such as loop nterchange, loop permutaton, skew, reversal, wavefront, tlng, etc. But lttle research been done for the loops wth non-unform dependences. The exstng commercal parallelzng complers and research parallelzng complers, such as Stanford's SUIF, CSRD's Parafrase-, and Unversty of Maryland's Omega Proect, can parallelze most of the loops wth unform dependences. But they do not satsfactorly handle loops wth non-unform dependences. Most of the tme, the compler treats such loops as un- The Computer Journal, Vol., No., 997

2 Unque Sets Orented Parallelzaton of Loops wth Non-unform Dependences Example : Example : Example : do =, do =, do =, do =, do =, do =, A( + ; ) = A( + ; + ) = A(B(); C()) = = A(; ) + A(; + ) = A( + + ; + + ) = A(B(? ); C( + )) FIGURE. Examples of loops wth derent knds of dependences ( a ) ( b ) FIGURE. Iteraton spaces wth Unform dependences and Non-unform dependences parallelzable and leaves them runnng sequentally. For nstance, nether SUIF nor Parafrase- can parallelze the loop n Example. Unfortunately, loops wth nonunform dependences are not so uncommon n the real world. In an emprcal study, Shen et al. observed that nearly % of two dmensonal array references are coupled, whch means array subscrpts are lnear combnatons of loop ndces. These coupled subscrpts lead to non-unform dependence. Hence, t s mperatve to gve loops wth non-unform dependence a serous consderaton, even though they are more dcult to parallelze.. SURVEY OF RELATED RESEARCH The convex hull created by solvng the lnear Dophantne equatons s requred for detectng parallelsm n non-unform loops snce t s the least abstracton to have adequate nformaton to accomplsh the detecton of parallelsm n non-unform loops 7. Thus, most of the technques proposed for parallelzng loops wth nonunform dependences are based on dependence convex hull theory. These can be classed nto four categores: unformzaton, unform parttonng, non-unform parttonng, and nteger programmng based parttonng... Unformzaton Ths paper focuses on parallelzaton of perfectly nested loops wth non-unform dependences. The rest of ths paper s organzed as follows. Secton two surveys the research n parallelzaton of non-unform dependence loops. Secton three revews the Dependence Convex Hull theory and ntroduces the Complete Dependence Convex Hull. Secton four gves the denton of unque sets and the technques to nd them. Secton ve presents our unque set orented parttonng approach. Secton sx extends our technque to a general program model wth multple nestngs. Secton seven conrms the superorty of our technque wth an mplementaton on Cray J9 and comparson wth prevously proposed technques. Fnally, we conclude n secton eght. Tzen and N proposed the dependence unformzaton technque. Based on solvng a system of Dophantne equatons and a system of nequaltes, they compute the maxmal and mnmal dependence slopes of any unform and non-unform dependence pattern n a twodmensonal teraton space. Then, by applyng the dea of vector decomposton, a set of basc dependences s chosen to replace all orgnal dependence constrants n every teraton so that the dependence pattern becomes unform. They also proved that any doubly nested loop could always be unformzed to a unform dependence loop wth two dependence vectors. They proposed an ndex synchronzaton method to reduce the synchronzaton, n whch synchronzaton could be systematcally nserted. Ths unformzaton helps n applyng exstng parttonng and schedulng technques. But t The Computer Journal, Vol., No., 997

3 J. Ju and V. Chaudhary mposes too many dependences to the teraton space whch otherwse has only a few of them. Chen and Yew9 presented a scheme whch computes a Basc Dependence Vector Set and schedules the teratons usng Statc Strp Schedulng. They extended the dependence unformzaton technque of Tzen and N and presented algorthms to compute better basc dependence vector sets whch extract more parallelsm from the nested loops. The program model s more general, ncludng non-perfect nested loops. Whle ths technque s dentely an mprovement over Tzen and N's work, t also mposes too many dependences on the teraton space, thereby reducng the extractable parallelsm. Moreover, ths unformzaton needs a lot of synchronzaton. Chen and Shang proposed another unformzaton technque. They form the set of basc dependence vectors and mprove ths set usng certan obectve functons. They select those basc dependence vectors whch are tme-optmal and cone-optmal. After unformzng the teraton space, they use optmal lnear schedules to order the executon of the teratons. Ths technque lke both the prevous unformzaton technques mpose too many dependences... Unform Parttonng Punyamurtula and Chaudhary extended the theory of Convex Hull to the Integer Dependence Convex Hull(IDCH) and proposed a Mnmum Dependence Dstance Tlng technque. Every nteger pont n the IDCH corresponds to a dependence vector n the teraton space of the nested loops. They showed that the mnmum and maxmum values of the dependence dstance functon occur at the extreme ponts of the IDCH. Therefore, t s only necessary to calculate the dependence dstance at the extreme ponts and compare all the values of the dstance to get the mnmum dependence dstance. These mnmum dependence dstances are used to partton the teraton space nto tles of unform sze and shape. The wdth of tles s less than or equal to the mnmum dependence dstance n at least one drecton. Ths would guarantee that for any dependence vector, ts head and tal would fall nto derent tles. Iteratons n a tle would be executed n parallel. Tles n a group would be executed n sequence and the dependence slope nformaton of Tzen and N can be used to synchronze the executon of nter-group tles. Ths technque works very well for cases when the mnmum dstance n one drecton s large. It does not work as well for the case when the dependence dstances are small as t would nvolve too much synchronzaton overhead... Non-unform Parttonng Zaafran and Ito proposed the three-regon technque. Ths technque dvdes the teraton space nto two parallel regons and one sequental regon. The teratons n the parallel regons can be executed fully n parallel whle the teratons n the sequental regon can only be executed sequentally. Two parallel regons are called Area and Area, respectvely, and the sequental regon s called Area. Area represents the part of the teraton space where the destnaton teraton comes lexcally before the source teraton. The teratons n Area can be fully executed n parallel provded that varable renamng s performed. Area corresponds to the regon where the drecton vector s equal to (<, ) or equal to (=, <). Area represents the part of the teraton space where the destnaton teraton comes lexcally after the source teraton and the source teraton s n Area. If Area s executed rst, then the nodes n Area can be executed n parallel. Area represents the rest of the teraton space (teraton space - (Area [ Area)). Once Area and Area are executed, then the nodes n Area should be executed sequentally. Zaafran and Ito apply ther technque to the entre teraton space, though t wll suce to applyng t only to the DCH or IDCH. The nodes that are not n the DCH can be executed n parallel because of the nonexstence of dependences for these nodes. Ths s equvalent to dvdng the teraton space nto four regons (Area, Area, Area, and non-dch). Agan ths technque has ts dsadvantages. The sequental part of the teraton space s the bottleneck for the performance. If the sequental part of teraton space s small, ths technque s ne. Otherwse the sequental part can be a serous drawback n performance... Integer Programmng Based Approach Tseng et. al. proposed a parttonng scheme usng Integer Programmng technques. They start wth an orgnal dependence vector set and dvde t nto eght groups. They nd the mnmum dependence vector set by solvng nteger programmng formulatons. Then they use mnmum dependence vector set to represent the dependence vectors of nested loops and partton the teratons of loops nto groups. All teratons n the same group can be executed at the same tme. They also proposed a group synchronzaton method for arrangng synchronzaton. But the method they used to compute the mnmum dependence vector set may not always gve mnmum dependence dstances. Besdes, nteger programmng approach s tme-consumng. Pugh and Wonnacott construct several sets of constrants that descrbe, for each statement, whch teratons of that statement can be executed concurrently. By constructng constrants that correspond to derent assumptons about whch dependences mght be elmnated through addtonal analyss, transformatons, and user assertons, they determne whether they can expose parallelsm by elmnaton dependences. Then they look for condtonal parallelsm, and try to dentfy the knds of teraton-reorderng transformatons that could be used to produce parallel loops. However, ther method The Computer Journal, Vol., No., 997

4 Unque Sets Orented Parallelzaton of Loops wth Non-unform Dependences θ θ o o ( a ) parallel to axs ( b ) < θ < 9 (c ) parallel to axs ( d ) -9 < θ< FIGURE. Possble dependence drectons n lexcographc order may produce false dependences.. DEPENDENCE ANALYSIS Cross-teraton dependence s the maor concern that may keep the program from runnng n parallel. For the four types of data dependences, ow, ant, output, and nput dependence, nput dependence mposes no orderng constrants, so we only look at the other three types. We won't consder output dependences as real dependences ether. We can always use the storage replcaton technque to allow the statements whch have output dependences to execute concurrently. Ths research wll look at the cases of ow dependences and ant dependences. Data dependence denes the executon order among teratons. The executon order can be expressed as Lexcographc order. Lexcographc order can be shown as an arrow n the teraton space, whch also represents the dependence vector. All the arrows n Fgure are n lexcographc order. The teraton correspondng to the arrow head cannot be executed untl the teraton correspondng to the tal has been executed. All the dependences dscussed n ths paper are put nto lexcographc order. If there s a dependence from teraton to teraton, and executes before, we represent t by drawng an arrow!. Fgure shows all four possble drectons f all the dependence vectors are put n lexcographc order wth two level of loops, where s the ndex for the outer loop and s the ndex for the nner loop. The runnng order mposes that there cannot exst an arrow pontng to the left or an arrow parallel to axs and pontng down. The arrows here are the dependence vectors... Dependence and Convex Hull Studes, show that most of the loops wth complex array subscrpts are two dmensonal loops. We start wth ths typcal case. We smplfy our general program model to a normalzed, doubly nested loop wth coupled subscrpts (.e., wth subscrpts beng lnear functons of loop ndces) as shown n gure. We wsh to dscover what cross-teraton dependences do = L, U do = L, U A(a + b + c ; a + b + c ) = = A(a + b + c ; a + b + c ) FIGURE. Doubly Nested Loop Model exst between the two references to array A n the program model. There are a large varety of tests that can prove ndependence n some cases. It s nfeasble to solve the problem drectly, even for lnear subscrpt expressons, because ndng dependences s equvalent to the NP-complete problem of ndng nteger solutons to systems of lnear Dophantne equatons7. Two general and approxmate tests are GCD and Baneree's nequaltes9. Recently, Subhlok and Kennedy proposed a new search procedure that dentes an nteger soluton n a convex regon, or prove that no nteger solutons exst. The most common methods to compute data dependence s to solve a set of lnear Dophantne equatons wth a set of constrants whch are the teraton boundares. A dependence exsts only f the equatons have a soluton. We want to nd a set of nteger solutons ( ; ; ; ) that satsfy the system of Dophantne equatons () and the system of lnear nequaltes (). a + b + c = a + b + c a + b + c = a + b + c () >< >: L U L U L U L U () Once the general solutons are found, dependence nformaton can be represented by dependence vector. The dependence s unform when dependence vectors are constants. Otherwse the dependence s nonunform. The Computer Journal, Vol., No., 997

5 J. Ju and V. Chaudhary The data dependence analyss technques do well on loops wth unform dependences snce dependence dstance vectors can be calculated precsely. A lot of research has been done for unform dependence analyss and loop transformaton technques,,,. However, for the case of non-unform dependences, Yang, Ancourt and Irgon7 showed that drecton vector alone does not have enough nformaton for transformng nonunform dependence. Dependence Convex Hull (DCH) s the least requrement f we want to parallelze loops wth non-unform dependence. DCHs are convex polyhedrons and are subspace of the soluton space. Frst of all, we show how to nd DCHs. There are two approaches to solve the system of Dophantne equatons of (). One way s to set to x and to y and get the soluton to and. a + b + c = a x + b y + c a + b + c = a x + b y + c We have the soluton as = x + y + where = x + y + = a b? a b a b? a b = b b? b b a b? a b = b c + b c? b c? b c a b? a b = a a? a b a b? a b = a b? a b a b? a b = a c + a c? a c? a c a b? a b The soluton space S s the set of ponts (x; y) satsfyng the soluton gven above. Now the set of nequaltes can be wrtten as >< >: L x U L y U () L x + y + U L x + y + U where () denes a DCH denoted by. Another approach s to set to x and to y and solve for the soluton to and. a + b + c = a x + b y + c a + b + c = a x + b y + c We have the soluton as = x + y + where = x + y + = a b? a b a b? a b = b b? b b a b? a b = b c + b c? b c? b c a b? a b = a a? a b a b? a b = a b? a b a b? a b = a c + a c? a c? a c a b? a b The soluton space S s the set of ponts (x; y) satsfyng the soluton gven above. Now the set of nequaltes can be wrtten as >< >: L x + y + U L x + y + U () L x U L y U where () denes another DCH, denoted by DCH. Both sets of solutons are vald. Each of them has the dependence nformaton on one extreme. For some smple cases, for nstance, there s only one knd of dependence, ether ow or ant dependence, one set of solutons(:e: DCH) should be enough. Punyamurtula and Chaudhary used constrants () for ther technque, whle Zaafran and Ito used () for ther technque. For those more complcated cases, where both ow and ant dependences are nvolved and dependence patterns are rregular, we need to use both sets of solutons. We wll ntroduce a new term Complete Dependence Convex Hull to summarze these two DCHs and we demonstrate that the Complete DCH contans complete nformaton about dependences... Complete Dependence Convex Hull (CDCH) Defnton. (Complete DCH (CDCH)). Complete DCH s the unon of two closed sets of nteger ponts n the teraton space, whch satsfy () or (). 9 7 DCH 7 9 FIGURE. CDCH of Example Fgure shows the CDCH of Example. We use an arrow to represent a dependence n the teraton space. We call the arrow's head the dependence head and the arrow's tal the dependence tal. The Computer Journal, Vol., No., 997

6 Unque Sets Orented Parallelzaton of Loops wth Non-unform Dependences 7 Theorem.. All the dependence heads and tals le wthn the CDCH. The head and tal of any partcular dependence le n the two DCHs of the CDCH. Proof. Let us assume that ( ; ) s dependent on ( ; ). In the teraton space graph we can have an arrow from ( ; ) to ( ; ). Here ( ; ) s the arrow tal and ( ; ) s the arrow head. Because of the exstng dependence, ( ; ) and ( ; ) must satsfy the system of lnear Dophantne equatons () and the system of lnear nequaltes (). There are four unknown varables. We can reduce two unknown varables by settng = x and = y and solve for and. Then and must satsfy (). Hence ( ; ) les n the area dened by () whch s one of the DCH of the CDCH. In the same way, we reduce and by settng = x and = y and solve for and. Here ( ; ) les n the area dened by () whch s another DCH of the CDCH. Therefore, both ( ; ) and ( ; ) fall nto dfferent DCHs of the CDCH. If teraton ( ; ) s dependent on ( ; ), then dependence vector D(x, y) s expressed as: d (x; y) =? d (x; y) =? So, for, we have d (x ; y ) = (? )x + y + d (x ; y ) = x + (? )y + () For DCH, we have d (x ; y ) == (? )x? y? d (x ; y ) =? x + (? )y? () Clearly f there s a soluton (x ; y ) n, there must be a soluton (x ; y ) n DCH, because they have been solved from the same set of lnear Dophantne equatons (). Gven the dependence vectors above, there must exst a mnmum and a maxmum value of D(x; y). It was shown by Punyamurtula and Chaudhary that the mnmum and maxmum values of the dependence D(x; y) occur at the extreme ponts of the DCH.. UNIQUE SETS IN THE ITERATION SPACE If a loop has cross-teraton dependences, we can construct ts CDCH (comprsng of and DCH). As we have proved earler, all dependences le wthn the CDCH. In other words, the teratons lyng outsde the CDCH can be executed n parallel. Punyamurtula and Chaudhary proposed the concept of mnmum dependence dstance tlng, whch gves an excellent parttonng of teraton space for the case when ~d(x; y) = ~ does not pass through any DCH. However, mnmum dependence dstance cannot be calculated when d(x; ~ y) = ~ passes through the DCH. Our technque works well for both the cases. Suppose all dependence tals fall nto and all dependence heads fall nto DCH (Fgure ) and the two DCHs do not overlap. Partton can be done by drawng a lne between the two DCHs. The area contanng the DCH of tal wll execute rst followed by the area contanng the DCH of heads. Fgure llustrates ths fact by rst executng area followed by area. The teratons wthn the two areas are fully parallelzable. The dea behnd the above example s to nd separate sets that contan the dependence heads and tals. We want to mnmze these sets and then partton the teraton space by drawng lnes separatng these sets n the teraton space. The executon order s determned by whether the set contans heads or tals. The next problem how s to nd unque sets. The problem s compounded f these sets overlap... Unque Head and Unque Tal Sets There are only two DCHs gven the program model n Fgure. All the dependence heads and tals wll le wthn these two DCHs. These areas are our prmtve sets. For one partcular set, t s qute possble that t contans both the dependence heads and tals. Because of the complexty of the problem, we have to dstngush between the ow and ant dependences, and partton the teraton space n a non-unform way because the dependence tself s non-unform. Let us look at Fgure whch shows the CDCH of Example. We note that contans all ant dependence heads and all ow dependence tals. DCH contans all the ow dependence heads and ant dependence's tals. Fgure 7 separates the ow and ant dependences to gve a clearer pcture. It can be found out that s the unon of ow dependence tal set and ant dependence head set, and DCH s the unon of ow dependence head set and ant dependence tal set. Hence, the followng denton s derved to dstngush the sets. Defnton. (Unque Head(Tal) Set). Unque head(tal) set s a set of nteger ponts n the teraton space that satses the followng condtons:. t s subset of one of the DCH (or s the DCH tself).. t contans all the dependence arrow's heads(tals), but does not contan any other dependence arrow's tals(heads). Obvously the DCHs n Fgure 7 are not the unque sets we are tryng to nd, because each DCH contans The Computer Journal, Vol., No., 997

7 J. Ju and V. Chaudhary FIGURE. Parttonng wth two non-overlappng DCHs 9 7 DCH 9 7 DCH FIGURE 7. Flow dependence, Ant dependence the dependence heads of one knd and the dependence tals of the other knd. Therefore, these DCHs must be further parttoned nto smaller unque sets... Fndng Unque Head and Unque Tal Sets Frst propertes of and DCH must be examned. Theorem.. contans all ow dependence tals and all ant dependence heads (f they exst) and DCH contans all ant dependence tals and all ow dependence heads (f they exst). Proof. The system of nequaltes n () denes and = x = y = x + y + = x + y + If there exsts a ow dependence, we can assume that ( ; ; ; ) s a soluton to the ow dependence. From the denton of ow dependence, ( ; ) should be wrtten somewhere n the teraton space before ( ; ) s referenced. So we can draw an arrow from ( ; ) to ( ; ) n the teraton space to represent the dependence and executon order as ( ; )! ( ; ) whch s equvalent to (x ; y )! ( x + y + ; x + y + ). Here (x ; y ) s the arrow tal. Snce (x ; y ) satses () and we have assumed that ( ; ; ; ) s a soluton, must contans all ow dependence tals. If there exsts an ant dependence, we can agan assume that ( ; ; ; ) s a soluton to the ant dependence. From the denton of ant dependence, we have an arrow from ( ; ) to ( ; ),.e., ( x + y + ; x + y + )! (x ; y ). Snce (x ; y ) s the arrow's head and (x ; y ) satses (), contans all ant dependence heads. The proof that DCH contans all ant dependence tals and ow dependence heads (f they exst) s smlar to the proof for. The above theorem tells us that and DCH are not unque head or unque tal sets f there are both ow and ant dependences. If there exst only ow or ant dependence, ether contans all the ow dependence tals or ant dependence heads, and DCH ether contans all the ow dependence heads or ant dependence tals. Under these condtons, both and The Computer Journal, Vol., No., 997

8 Unque Sets Orented Parallelzaton of Loops wth Non-unform Dependences 9 DCH are unque sets. The followng theorem states the condton for and DCH to be unque sets. Theorem.. If d (x; y) = does not pass through any DCH, then there s only one knd of dependence, ether ow or ant dependence, and the DCH tself s the unque head set or the unque tal set. [Part ] d (x ; y ) corresponds to and Proof. d (x ; y ) corresponds to DCH. Suppose d (x ; y ) does not pass through. Snce d (x ; y ) =? = (? )x + y + and the teraton (x ; y ) that satses () must not satsfy (? )x + y + = (d (x ; y ) = s a lne n the teraton space), must be on one sde of d (x ; y ) =,.e., ether d (x ; y ) < or d (x ; y ) >. Frst let us look at the case when d (x ; y ) <. If d (x ; y ) <, then x + y + s always less than x. Thus, > s always true. Also, the array element correspondng to ndex s wrtten and the array element correspondng to ndex s read. Clearly, only ant dependence can satsfy ths condton. Therefore, contans only ant dependences. Next, let us look at the case when d (x ; y ) >. Here <. Clearly, only ow dependence can satsfy ths condton. Therefore, contans only ant dependence. The proof for DCH follows smlarly. Thus, f d (x; y) = does not pass through any DCH, then there s only one knd of dependence. [Part ] We have already shown above that f d (x; y) = does not pass through, then there s only one knd of dependence. If the dependence s ow dependence, then from theorem, contans only the ow dependence tals or ant dependence heads, makng a unque tal or head set. Smlarly, f the dependence s ant dependence, then from theorem, DCH contans only the ant dependence tals or ow dependence heads, makng DCH a unque tal or head set. and DCH are constructed from the same system of lnear Dophantne equatons and system of nequaltes. The followng two theorems hghlght the common attrbutes. Theorem.. If d (x ; y ) = does not pass through, then d (x ; y ) = does not pass through DCH. Proof. If d (x ; y ) = does not pass through, then ether les on the sde where d (x ; y ) < or on the sde where d (x ; y ) >. Frst let us consder the case when s on same sde of d (x ; y ) <. Snce d (x ; y ) s?, we have that <. We can nd the same soluton ( ; ; ; ) for DCH, because they are solved from the same set of lnear Dophantne equatons. d (x ; y ) s also dened as?. Hence, we can get d (x ; y ) < whch means d (x ; y ) = does not pass through DCH. The second case when s on the same sde of d (x ; y ) > can be proved smlarly.. Corollary.. When d (x ; y ) = does not pass through,. f d (x ; y ) > n, s ow dependence unque tal set. DCH s ow dependence unque head set.. f d (x ; y ) < n, s ant dependence unque head set. DCH s ant dependence unque tal set. Proof. It follows from theorems and. Corollary.. When d (x ; y ) = does not pass through,. f d (x ; y ) > n, then d (x ; y ) > n DCH.. f d (x ; y ) < n, then d (x ; y ) < n DCH. Proof. It s obvous from the above theorems and proofs gven. We have now establshed that f d (x ; y ) = does not pass through, then both and DCH are unque sets. When d (x; y) = passes through the CDCH, a DCH mght contan both the dependence heads and tals (even f and DCH do not overlap). Ths makes t harder to nd the unque head and tal sets. The next theorem looks at some common attrbutes when d (x; y) = passes through the CDCH. Theorem.. If d (x ; y ) = passes through, then d (x ; y ) = must pass through DCH. Proof. Suppose d (x ; y ) = passes through. Then we must be able to nd (x ; ) such that y d (x ; y) < and (x ; y ) such that d (x ; y ) > n. Correspondngly we can nd (x ; ) and y (x ; y ) n DCH such that d (x ; ) = y? = d (x ; y) and d (x ; y ) =? = d (x ; y ). Therefore, we have d (x ; ) < and d y (x ; y ) >. Hence, d (x ; y ) = must pass through DCH. Usng the above theorem we can now deal wth the case where a DCH contans all the dependence tals of one knd and all the dependence heads of another knd. Theorem.7. If d (x; y) = passes through a DCH, then t wll dvde that DCH nto a unque tal set and a unque head set. Furthermore, d (x; y) = decdes the ncluson of d (x; y) = n one of the sets. Proof. The proof for and DCH are symmetrc. Let us consder the case where d (x ; y ) = passes through. Frst consder ow dependences. Wthout loss of generalty, let ( ; ) and ( ; ) be the teratons whch cause any ow dependence. Then, ( ; ) and ( ; ) satsfy (). Thus, from the denton of The Computer Journal, Vol., No., 997

9 J. Ju and V. Chaudhary ow dependence, we have ether < or = and <. We can now solve () wth = x = y = x + y + (7) = x + y + Snce x < x + y +, we have (? )x + y + = d (x ; y ) >. From the above equatons we also have x = x + y + and y < x + y +, whch gves us d (x ; y ) = and d (x ; y ) >. Now let us consder ant dependence. We ether have > or = and >. Snce x > x + y +, we have (? )x + y + = d (x ; y ) <. From the set of equatons (7) above we also have x = x + y + and y > x + y +, whch gves us d (x ; y ) = and d (x ; y ) <. d (x ; y ) = dvdes nto two parts, d (x ; y ) > and d (x ; y ) <. Flow dependences satsfy d (x ; y ) >. From theorem we know that these are the ow dependence tals. Whether d (x ; y ) = belongs to ths set s dependent on whether d (x ; y ) > or not. Therefore, d (x ; y ) decdes the ow dependence unque tal set. Smlarly d (x ; y ) decdes the ant dependence unque head set. Note that f d (x ; y ) >, then the lne segment correspondng to d (x ; y ) = belongs to the ow dependence unque tal set and f d (x ; y ) <, then the lne segment correspondng to d (x ; y ) = belongs to the ant dependence unque head set. The teraton correspondng to the ntersecton of d (x ; y ) = and d (x ; y ) =, has no cross-teraton dependence. If the ntersecton pont of d (x ; y ) = and d (x ; y ) = les n, then one segment of the lne d (x ; y ) = nsde s a subset of the ow dependence unque tal set and the other segment of the lne d (x ; y ) = nsde s a subset of the ant dependence unque head set. For DCH, we have smlar results as above. To summarze, the followng corollary s derved. Corollary.. The ow dependence unque tal set s expressed by L x U L >< y U L x + y + U L x + y + U d >: (x ; y ) > and d (x ; y ) = d (x ; y ) > The ant dependence unque head set s expressed by L x U L >< y U L x + y + U L x + y + U d (x ; y ) < and d (x ; y ) = d (x ; y ) < The ow dependence unque head set s expressed by >: L x + y + U L >< x + y + U L x U L y U d (x ; y ) > and d (x ; y ) = d (x ; y ) > The ant- dependence unque tal set s expressed by >: The Computer Journal, Vol., No., 997 >< >: L x + y + U L x + y + U L x U L y U d (x ; y ) < and d (x ; y ) = d (x ; y ) < Proof. It follows drectly from Theorem. Corollary.9. When d (x ; y ) = passes through, then. s the unon of the ow dependence unque tal set and the ant dependence unque head set.. DCH s the unon of the ow dependence unque head set and the ant dependence unque tal set. Proof. It follows from Corollary. Fgure llustrates the applcatons of our results to Example. Clearly d (x ; y ) = dvdes nto two parts. The area on the left sde of d (x ; y ) = s the ow dependence unque tal set and the area on the rght sde of d (x ; y ) = s the ant dependence unque head set. d (x ; y ) = belongs to ant dependence unque head set. d (x ; y ) = dvdes DCH nto two parts too. The area below d (x ; y ) = s the ow dependence unque head set and the area above d (x ; y ) = s the ant dependence unque tal set. d (x ; y ) = belongs to ant dependence unque tal set.. UNIQUE SETS ORIENTED PARTITION- ING In the prevous sectons we have grouped teratons based on ther beng unque head or tal sets. Clearly the unque head set wll execute after the unque tal set. For our program model, there are at most four sets,.e., ow dependence unque tal set, ow dependence head set, ant dependence unque tal set, and ant dependence unque head set. The teratons outsde these sets can be executed concurrently. Moreover, the teratons wthn each set can be executed concurrently. In order to maxmze the parallelsm, we want to partton the teraton space accordng to unque sets.

10 Unque Sets Orented Parallelzaton of Loops wth Non-unform Dependences 9 7 Flow dependence unque tal set d(x, y) = d(x, y) = Flow dependence unque head set 9 7 d(x, y) = Ant dependence unque head set Ant dependence unque tal set d(x, y) = FIGURE. Unque head sets and unque tal sets of Flow dependence, Ant dependence Ant dependence unque head set Ant dependence unque head set DCH Ant dependence unque tal set DCH Ant dependence unque tal set FIGURE 9. One knd of dependence and does not overlap wth DCH It s mportant, however, to note that the eectveness of a parttonng scheme depends on the archtecture of the parallel machne beng used. In ths paper we do not recommend parttons for partcular archtectures, rather, we explore the varous parttons that can be generated from the avalable nformaton. The sutablty of a partcular partton for a specc archtecture s not studed. Based on the unque head and tal sets that we can dentfy that there exst varous combnatons of overlaps (and/or dsontness) of these unque head and tal sets. We categorze these combnatons as varous cases startng from smpler cases and leadng up to the more complcated ones. Case: There s only one knd of dependence and does not overlap wth DCH. Fgure 9 llustrates ths relatvely easy case wth an example. Any lne drawn between and DCH dvdes the teraton space nto two areas. Insde each area, all teraton are ndependent. The DCHs n ths case are unque head and unque tal sets. The teratons wthn each DCH can be executed concurrently. However, DCH needs to execute before as shown by the parttonng n Fgure 9. The executon order s gven as!. From the mplementaton pont of vew, t s advsable to partton the teraton space along the or axs so that the parttoned areas can be easly represented as a loop. It s also advsable to partton the teraton space as evenly as possble. However, the nal decson on parttonng wll depend on the underlyng archtecture. Case : There s only one knd of dependence and overlaps wth DCH. Fgure llustrates ths case. and DCH overlap to produce three dstnct areas denoted by Area, Area, and Area, respectvely. Area and Area are ether unque tal or unque head sets and thus teratons wthn each set can execute concurrently. Area contans both dependence heads and tals. We can apply the Mnmum Dependence Dstance Tlng technque proposed by Punyamurtula and Chaudhary to Area. Dependng on the type of dependence there are two dstnct executon orders possble. If DCH s a unque tal set, then the executon order s Area! Area! Area. Otherwse the executon order s Area! Area! Area. From the mplementaton pont of vew, we want to use a straght lne to partton the teraton space, so The Computer Journal, Vol., No., 997

11 J. Ju and V. Chaudhary Ant dependence unque tal set Area Area Ant dependence unque head set Area Ant dependence unque tal set Ant dependence unque head set DCH DCH FIGURE. One knd of dependence and overlaps wth DCH Flow dependence unque tal set Ant dependence unque head set Ant dependence unque tal set DCH DCH Flow dependence unque head set FIGURE. Two knds of dependence and does not overlap wth DCH that the generated code wll be much smpler. An example parttonng s shown n Fgure for the problem n Fgure. The executon order s gven as!!!. Another approach to parallelze the teraton space n ths case s to apply the Mnmum Dependence Dstance Tlng technque drectly to the entre teraton space. Case : There are two knds of dependence and does not overlap wth DCH. Fgure llustrates ths case. Snce and DCH are dsont we can partton the teraton space nto two, wth and DCH belongng to dstnct parttons. From Theorem we know that d (x; y) = wll dvde the DCHs nto unque tal and unque head sets. Next, we partton the area wthn by the lne d (x ; y ) =, and the area wthn DCH by the lne d (x ; y ) =. So, we have four parttons, each of whch s totally parallelzable. Fgure gves one possble partton wth executon order as!!!. Note that the unque head sets must execute after the unque tal sets. Case : There are two knds of dependence and overlaps wth DCH, and there s at least one solated unque set. Fgure and (c) llustrate ths case. What we want to do s to separate ths solated unque set from the others. The lne d (x; y) = s the best canddate to do ths. If d (x; y) = does not ntersect wth any other unque set or another DCH, then t wll dvde the teraton space nto two parts as shown n Fgure. If d (x; y) = does ntersect wth other unque sets or another DCH, we can add one edge of the other DCH as the boundary to partton the teraton space nto two as shown n Fgure (d). Let us denote the partton contanng the solated unque set by Area. The other partton s denoted by Area. If Area contans a unque tal set, then Area must execute before Area, otherwse Area must execute after Area. The next step s to partton Area. Snce Area has only one knd of dependence (as long as we mantan the executon order dened above) and overlaps wth DCH, t falls under the category of case and can be further parttoned. Case : There are two knds of dependence and all unque sets overlap each other. Fgure llustrates ths case. The CDCH can be parttoned nto at most eght parts as shown n Fgure. These parttons are areas that contan The Computer Journal, Vol., No., 997

12 Unque Sets Orented Parallelzaton of Loops wth Non-unform Dependences Ant dependence unque head set Flow dependence unque tal set Area Flow dependence unque head set DCH DCH Ant dependence unque tal set Area Ant dependence unque head set Flow dependence unque tal set Area Flow dependence unque head set Area DCH DCH Ant dependence unque tal set (c) (d) FIGURE. Two knds of dependence and one unque set solated only ow dependence tals, and we denote t by Area. only ant dependence tals, and we denote t by Area. only ant dependence heads,and we denote t by Area. only ow dependence heads, and we denote t by Area. ow dependence tals and ant dependence tals, and we denote t by Area. ow dependence heads and ant dependence heads, and we denote t by Area. ow dependence tals and ow dependence heads, and we denote t by Area7. ant dependence tals and ant dependence heads, and we denote t by Area. Area, Area, and Area can be combned together nto a larger area, because they contan only the dependence tals. Let us denote ths combned area by AreaI. In the same way, Area, Area, and Area can also be combned together, because they contan only the dependence heads. Let us denote ths combned area by AreaII. AreaI and AreaII are fully parallelzable. The executon order becomes AreaI! Area7! Area! AreaII. Snce Area7 and Area contan both dependence heads and tals, we can apply Mnmum Dependence Dstance Tlng technque to parallelze ths area. We may not always have all eght areas n ths case. For example, f d (x ; y ) = does not ntersect d (x ; y ) = nsde the CDCH, then ether Area7 or Area exsts, but not both. However, the proposed parttonng and executon order stll hold. Now let us go back to Example. From Fgure, we know that t ts n the category of Case FIGURE. Parttonng scheme for Example The parttonng scheme s shown n gure. There are ve areas. All the teratons n each area are fully parallelzable. These area should be run n the order of!!!!. Area s the overlappng area. Mnmum Dependence Dstance Tlng technque s adopted to partton along the drecton wth mnmum dstance of. The parallelzed code of Example The Computer Journal, Vol., No., 997

13 J. Ju and V. Chaudhary Ant dependence unque tal set Flow dependence unque head set DCH Area DCH Area Area 7 Area Area Area Area Flow dependence unque tal set Ant dependence unque head set Area FIGURE. Two knds of dependence and all unque sets overlapped each other s shown below. /* area */ doparallel =, doparallel = cel(= + ), mn(floor( + :); ) A( + ; + ) = = A( + + ; + + ) /* area */ doparallel =, doparallel = (floor(()= + ) + ), A( + ; + ) = = A( + + ; + + ) /* area */ doparallel = floor(()= + ), cel(x + 7=) doparallel =, A( + ; + ) = = A( + + ; + + ) /* area */ doparallel = floor(()= + ), cel(x + 7=) doparallel = 9, A( + ; + ) = = A( + + ; + + ) /* area */ doparallel =, doparallel =, (cel(= + )? ) A( + ; + ) = = A( + + ; + + ) Ths parttonng scheme seems to be worse than other technques at rst glance. Ths s because the loop upper bounds s only. As the loop upper bounds ncrease, ths scheme wll show the advantage. No matter how large the loop s, t synchronzes only ve tmes. Synchronzaton overhead s always the maor factor that aects the performance.. EXTENSION TO GENERAL NESTED LOOPS We dscussed the parallelzaton of two dmensonal program model n the former sectons. We now look at loops wth n levels of nestngs whose ndces are,,, n. The array subscrpts are lnear functons of loop ndces as shown n gure. do = L, U do n = L n, U n S : A[f ( ; : : : ; n ); : : : ; f m ( ; : : : ; n )] = S : = A[g ( ; : : : ; n ); : : : ; g m ( ; : : : ; n )] FIGURE. General Program Model We want to nd a set of nteger solutons ( ; : : : ; n ; ; : : : ; n) that satsfy the system of Dophantne equatons () and the system of lnear nequaltes (9). f ( ; : : : ; n ) = g ( ; : : : ; n). () f m ( ; : : : ; n ) = g m ( ; : : : ; n) >< >: L U L U L n n U n L n n U n (9) To avod lengthy repetton, we consder DCH as an example to llustrate how to get unque sets. From former sectons, we know that DCH should contan The Computer Journal, Vol., No., 997

14 Unque Sets Orented Parallelzaton of Loops wth Non-unform Dependences ow dependence unque head set and ant dependence unque tal set. Usng the second approach to solve the set of Dophantne equatons, we have nteger solutons ( ; ; n ; ; ; n) whch are functons of x ; ; x n. They can be wrtten as: ( ; ; n ; ; ; n) = (s (x ; ; x n ); ; s n (x ; ; x n ); s n+ (x ; ; x n ); ; s n+n (x ; ; x n )) From the general soluton the dependence vector functon D(x ; ; x n ) can be wrtten as D(x ; ; x n ) = f(s n+ (x ; ; x n )? s (x ; ; x n )); >< >: ; (s n+n (x ; ; x n )? s n (x ; ; x n ))g Hence the dependence vectors are: d (x ; ; x n ) = (s n+ (x ; ; x n )? s (x ; ; x n )). d n (x ; ; x n ) = (s n+n (x ; ; x n )? s n (x ; ; x n )) The dependence vector D(x ; ; x n ) dvdes ths DCH nto two parts. One s ow dependence unque head set and the other s ant dependence unque tal set. The decson on the ownershp of D(x ; ; x n ) comes next. The theorems proposed n secton. are also vald for mult-dmensonal loops. d (x ; ; x n ) > belongs to ow dependence unque head set and d (x ; ; x n ) < belongs to ow ant dependence unque tal set. When d (x ; ; x n ) =, d (x ; ; x n ) has to be checked. If d (x ; ; x n ) >, then ow dependence unque head set contans d (x ; ; x n ) = and d (x ; ; x n ) >. If d (x ; ; x n ) <, then ant dependence unque head tal contans d (x ; ; x n ) = and d (x ; ; x n ) <. For d (x ; ; x n ) =, d (x ; ; x n ) has to be checked. We contnue n ths fashon untl d n (x ; ; x n ) s checked. Usng ths method, we can get the unque sets for the gven general program model. Accordng to the postons of these sets, we can partton the teraton space. Durng the parttonng, the area contanng unque tal set must be run before the area contanng unque head set. The parttonng process s bascally the same as for doubly nested loops, except that we now deal everythng wth mult-dmensonal teraton space. The shape of the unque set s also mult-dmensonal. An alternatve way to parallelze mult-dmensonal loops s to parallelze only the two outer most loop nests, leavng nner loops runnng sequentally. The advantages of one approach over the other s left for future work. However, we feel that mult-dmensonal unque set of parttonng wll gve us greater exblty to transform the loops to adapt specc archtectures. 7. EXPERIMENTAL RESULTS We present results for two programs. The rst program s smlar to Example as shown n Fgure. We tested the performance for varyng loop szes. The loop szes (SIZE) used n the experments are,,, and. do =, SIZE do =, SIZE A( + ; + ) = = A( + + ; + + ) FIGURE. Program SUBROUTINE CHOLSKY (IDA, NMAT, M, N, A, NRHS, IDB, B) C C CHOLESKY DECOMPOSITION/SUBSTITUTION SUBROUTINE. C C // D H BAILEY MODIFIED FOR NAS KERNEL TEST C 7 REAL A(:IDA, -M:, :N), B(:NRHS, :IDB, :N), EPSS(:) DATA EPS/E-/ 9 C C CHOLESKY DECOMPOSITION C DO J =, N I = MAX ( -M, -J ) C C OFF DIAGONAL ELEMENTS C 7 DO I = I, - DO JJ = I - I, - 9 DO L =, NMAT A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO L =, NMAT A(L,I,J) = A(L,I,J) * A(L,,I+J) C C STORE INVERSE OF DIAGONAL ELEMENTS C DO L =, NMAT 7 EPSS(L) = EPS * A(L,,J) DO JJ = I, - 9 DO L =, NMAT A(L,,J) = A(L,,J) - A(L,JJ,J) ** DO L =, NMAT A(L,,J) =. / SQRT ( ABS (EPSS(L) + A(L,,J)) ) C C SOLUTION C DO I =, NRHS 7 DO 7 K =, N DO L =, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,,K) DO 7 JJ =, MIN (M, N-K) DO 7 L =, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) C DO K = N,, - DO 9 L =, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,,K) 7 DO JJ =, MIN (M, K) DO L =, NMAT 9 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K) C RETURN END FIGURE 7. Program The second program s shown n Fgure 7. Ths s a subroutne taken from a benchmark test program whch has been developed for use by the NAS program at NASA Ames Research Center to ad n the evaluaton of supercomputer. Ths subroutne deals wth the problem of Cholesky Decomposton and Substtuton. We are more nterested n the part from lne 7 to lne. Non-unform dependences can be found n ths part of the program. To llustrate the mpact of non-unform dependence and to make our experment more comprehensve, we use the entre subroutne to evaluate the performance of our technque. In fact, the varable N and N M AT decde the program sze n ths part of program. When we say the the program sze s, both N and NMAT are set to. We present results for The Computer Journal, Vol., No., 997

15 J. Ju and V. Chaudhary Speedup of Unque Sets method Speedup of Chen and Yew s method Speedup of Cray s autotaskng Speedup of Omega proect Speedup of Zaafran and Ito s method Lnear Speedup Speedup of Unque Sets method Speedup of Chen and Yew s method Speedup of Cray s autotaskng Speedup of Omega proect Speedup of Zaafran and Ito s method Lnear Speedup Speedup Speedup CPUs CPUs SIZE = SIZE = Speedup of Unque Sets method Speedup of Chen and Yew s method Speedup of Cray s autotaskng Speedup of Omega proect Speedup of Zaafran and Ito s method Lnear Speedup Speedup of Unque Sets method Speedup of Chen and Yew s method Speedup of Cray s autotaskng Speedup of Omega proect Speedup of Zaafran and Ito s method Lnear Speedup Speedup Speedup CPUs CPUs (c) SIZE = (d) SIZE = FIGURE. Performance Results for Program on Cray program szes,,, and, respectvely. All the experments are done on a Cray J9 wth processors. Autotaskng Expert System(atexpert) are used to analyze the program. Atexpert s a tool developed by CRI (Cray Research, Inc.) for accurately measurng and graphcally dsplayng taskng performance from a ob run on an arbtrarly loaded CRI system. It can predct speedups on a dedcated system from data collected from a sngle run on a non-dedcated system. It shows where a program s spendng most of ts tme and whether those areas are executed sequentally or n parallel. User-Drected Taskng drectves are used to construct parallelzable areas n the teraton space. Synchronzatons are mplemented wth the help of guarded regon. The format s as below. #pragma CRI parallel defaults #pragma CRI taskloop loop #pragma CRI endparallel #pragma CRI guard loop or varable #pragma CRI endguard Our results are compared wth those of Chen and Yew's method 9, Cray's natve Autotaskng, Omega proect of Unversty of Maryland, and Zaafran and Ito's method. Zaafran and Ito's method s not mplemented for Program, because t s unable to handle non-perfect nestngs of loops. To mplement Chen and Yew's method, guarded regons were used to smulate the functon of semaphore. For the method of Omega proect, verson. of the Omega Proect software was used. We run the source codes through P ett, a research tool developed by Unversty of Maryland. It calls both the Omega lbrary and the Unform lbrary and generates parallelzed c source code. We rewrte the parallelzed source codes wth Cray's Autotaskng drectves to do the experments. Fgure shows the speedup comparson of our technque, Chen and Yew's technque, Cray's autotaskng, Omega proect, and Zaafran and Ito's three-regon technque. Cray's autotaskng dd not gve any speedup at all, runnng the loops sequentally. Omega proect dd not parallelze ths program ether. It s not so clear n Fgure, because the speedups of Omega proect and those of Cray's autotaskng are overlapped. Both are. Our method shows near lnear speedup wth the loop sze of and, whch are the models closer to the real world programs. Our technque s consstently outperforms other technques consderably for all szes. Chen and Yew's gave some speedup, but not too much, The Computer Journal, Vol., No., 997

16 Unque Sets Orented Parallelzaton of Loops wth Non-unform Dependences 7 Speedup of Unque Sets method Speedup of Chen and Yew s method Speedup of Cray s autotaskng Speedup of Omega proect Lnear Speedup Speedup of Unque Sets method Speedup of Chen and Yew s method Speedup of Cray s autotaskng Speedup of Omega proect Lnear Speedup Speedup Speedup CPUs CPUs Program sze = Program sze = Speedup of Unque Sets method Speedup of Chen and Yew s method Speedup of Cray s autotaskng Speedup of Omega proect Lnear Speedup Speedup of Unque Sets method Speedup of Chen and Yew s method Speedup of Cray s autotaskng Speedup of Omega proect Lnear Speedup Speedup Speedup CPUs CPUs (c) Program sze = (d) Program sze = FIGURE 9. Performance Results for Program on Cray because of the synchronzaton overhead. Zaafran and Ito's method showed very lttle speedup. The sequental regon of ther method s the bottle neck for good performance. The gure shows that the loop szes have a tremendous mpact on the performance even for the same loop usng the same parallelzaton technque. In practce, we alway want to parallelze the loops where programs spend most of ther tme. Fgure 9 shows the performance for the Cholesky Decomposton subroutne. From the plots, t s clear that our technque outperforms all the other technques. As program sze ncreases, our technque shows better results. Cray's Autotaskng got some speed up for ths routne. It parallelzed the nner most loop. Ths s more lke vectorzng than parallelzng. The result of Omega proect s worse than that of Cray's autotaskng when the program sze of, as shown n Fgure 9. As the program sze ncreases, t outperformed the Cray's autotaskng. When the program sze s, the performance of Omega proect s nearly twce that of Cray's autotaskng. The reason s that Cray's autotaskng only parallelzes the nnermost loops, whle Omega proect does not. Overall, Chen and Yew's technque performed worst. Agan, ncreased synchronzaton s responsble for ths.. CONCLUSION In ths paper, we systematcally analyzed the characterstcs of the dependences n the teraton space. We proposed the concept of Complete Dependence Convex Hull, whch contans the entre dependence nformaton of the program. We also proposed the concepts of Unque head sets and Unque tal sets whch solated the dependence nformaton and showed the relatonshp among the dependences. The relatonshp of the unque head and tal sets forms the foundaton for parttonng the teraton space. Dependng on the relatve placement of these unque sets, varous cases were consdered. Several parttonng schemes were also suggested for mplementatng our technque. The suggested scheme was mplemented on a Cray J9 and compared wth Chen and Yew's method 9, Cray's natve Autotaskng, Omega proect of Unversty of Maryland, and Zaafran and Ito's method. The mplementaton results of real benchmark code shows that our technque consstently outperformed all the other technques consderably. ACKNOWLEDGMENTS We would lke to thank Sumt Roy for hs help n the mplementaton of the technques on the Cray J9 and hs comments on a prelmnary draft of the paper. We The Computer Journal, Vol., No., 997

Common loop optimizations. Example to improve locality. Why Dependence Analysis. Data Dependence in Loops. Goal is to find best schedule:

Common loop optimizations. Example to improve locality. Why Dependence Analysis. Data Dependence in Loops. Goal is to find best schedule: 15-745 Lecture 6 Data Dependence n Loops Copyrght Seth Goldsten, 2008 Based on sldes from Allen&Kennedy Lecture 6 15-745 2005-8 1 Common loop optmzatons Hostng of loop-nvarant computatons pre-compute before