A Consistent Generation of Pipeline Parallelism and Distribution of Operations and Data among Processors

Size: px

Start display at page:

Download "A Consistent Generation of Pipeline Parallelism and Distribution of Operations and Data among Processors"

Brian Mark Ward
5 years ago
Views:

1 ISSN Programming and Computer Sotware 006 Vol. 3 No. 3 pp Pleiades Publishing Inc Original Russian Text E.V. Adutskevich N.A. Likhoded 006 published in Programmirovanie 006 Vol. 3 No. 3. A Consistent eneration o Pipeline Parallelism and Distribution o Operations and Data among Processors E. V. Adutskevich and N. A. Likhoded Institute o Mathematics National Academy o Sciences o Belarus ul. Surganova 11 Minsk 007 Belarus zhenya@im.bas-net.by likhoded@im.bas-net.by Received April Abstract The problem o mapping aine loop nests onto parallel computers with distributed memory is considered. A techniue or algorithm scheduling and distributing operations and data over processors is proposed. This techniue makes it possible to generate pipeline parallelism and minimize the amount o data exchanges between the processors. The method is adapted or automation and explicitly allows or dependence on outer variables o loops. DOI: /S INTRODUCTION The mapping o algorithms given by seuential programs onto parallel computers with distributed memory reuires distribution o algorithmic operations and data over processors as well as setting o the order o the execution o operations and data exchange. One can mention the ollowing key problems: scheduling [13] alignment [37] spacetime mapping [811] and blocking tiling) [ ]. The scheduling o loop nests is usually meant to involve parallelization i.e. euivalent transormation o loop nests such that some loops are executed in parallel. The alignment aims at minimizing the number and amount o data exchanges. The aim o the spacetime mapping is to distribute operations between the processors and speciy an optimal in terms o some criterion) order o operations. The blocking is designed or increasing computation granularity and the ratio o the amount o computations to the number o exchanges. Note that the important stage o solving the problems listed above is the search or unctions scheduling allocating graph unolding) satisying some limitations. One o the parallelization schemes in common use is based on the use o several scheduling unctions to generate pipeline parallelism [ ]. This scheme has the ollowing advantages: the possibility o generating high-level parallelism without an explicit indication o parallel loops a regular code and simpliied synchronization. I the scheduling unctions are partly used as allocation unctions the scheme makes it possible to solve the problem o spatial and temporal mapping and blocking. At the same time this approach reuires that the problem o consistent distribution o operations and data among processors be separately considered. This paper proposes a uniue techniue or generating pipeline parallelism and solving all problems listed above. The coordinated solution o these problems makes it possible to obtain scheduling and allocation unctions that perectly match each other.. BASIC DEINITIONS Let the algorithm be given by an aine loop nest o arbitrary nesting structure. or such algorithms the index expressions o variables and the variation ranges o loop parameters are aine unctions o loop parameters and outer variables. Let a loop nest contain K operators S and L arrays a l. The simple variables are assumed to be null-dimensional arrays. The range o variation o loop-nest parameters or an operator S is called the index domain and denoted by V. The range o index variation o the lth array is denoted by W l. By n we denote the number o loops enclosing the operator S and by ν l we denote the dimension o the lth array; then V Z n W l Z ν l. The indices o elements o the lth array in the operator S related to the th entrance o the elements into this array are expressed by the aine unction l J) l J + l N + l n where J V l Z ν l N N 1 N N e ) is the vector o outer variables e is the number o these variables l Z ν l e l ) Z ν l the matrices l and l and the vectors l ) do not depend on the outer variables. We call the realization execution) o the operator S at particular values o and the vector o loop parameters J an operation and denote it by S J). The execution o all operations depending on J is called the Jth iteration. 166

2 A CONSISTENT ENERATION O PIPELINE PARALLELISM 167 The operation S J) J V depends on the operation S α I) I V α i 1) S α I) is executed earlier than S J); ) S α I) and S J) use one and the same entry o an array and at least one o these uses is a redeinition change) o the entry; 3) between S α I) and S J) this entry is not redeined. The dependence o the operation S J) on S α I) is denoted by S α I) S J). Let us denote P {α ) I V α J V S α I) S J)}. The set P determines pairs o dependent operators. or each pair α ) P let us denote V α {J V S α I) S J)}. We assume that V α is a convex polyhedron in the space Z n. The unctions Φ α : V α V α such that i S α I) S J) I V α J V α V then I Φ α J) are called dependence unctions. We assume that these are aine unctions: Φ α J) Φ α J + Ψ α N ϕ α ) where J V α α ) P N N 1 N N e ) Φ α Z n α n Ψ α Z n α e ϕ α ) Z n α the matrices Φ α and Ψ α and vectors ϕ α ) do not depend on the outer variables. Let the unctions t ) : V Z 1 K assign an integer t ) J) to each operation S J) o the algorithm and let t ) be scheduling unctions t-unctions); i.e. t J) t α) Φ α J + Ψ α N ϕ α ) ) 1) J V α α ) P. This is called the condition or preserving dependences meaning that i S α I) S J) I Φ α J + Ψ α N ϕ α ) then the operation S α I) must be executed either earlier than S J) or at the same iteration as the operation S J). We assume that t ) J) are aine unctions t J) τ J + b N + a Z n where 1 K J V τ ) b ) N Z e a Z. Presumably τ ) b ) and a are independent o N. Let the unctions d l) : W l Z 1 l L associate each entry a l ) o the array a l with an integer number d l) ). The one-dimensional space into which the unctions d l) are mapped will be interpreted as a onedimensional space o virtual processors. I there are r sets ) l o unctions d ξ 1 ξ r one can consider an r-dimensional space o virtual processors. We consider the ) l aine unctions d ξ ) l d ξ η l + z l N + y l ξ Z ν l where 1 l L W l η l z l N Z e y l ξ Z. Presumably η l z l and y l ξ are independent o N. 3. ENERATION O PIPELINE PARALLELISM BY MEANS O SCHEDULIN UNCTIONS Let us denote n max. Suppose that we have n 1 K K) n independent sets o t-unctions 1 ξ n. The reuirement o independence o the sets is ormalized by the condition rangt n 1 K ) where T ) is an n n ) matrix whose rows are vectors τ deined in t-unctions τ ξ J + b ξ ) N + a ξ 1 ξ n. We will use r sets o unctions 1 K 1 ξ r < n or the spatial mapping o the algorithm operations into the r-dimensional space o virtual processors and the remaining n r sets or ordering in a lexicographical order) o computations perormed by processors. or the time unit we take the time needed or the longest iteration o the algorithm and data exchange. Proposition 1. Let M be the smallest integer parametrically depending on outer variables and variation ranges o loop-nest parameters that is greater than any outer variable and any variation range o the loop-nest parameters. Within the proposed scheme o parallelization OM r ) virtual processors can realize the algorithm or the time OM n r ). Proo. It ollows rom the deinition o the scheduling unction that an operation associated with a smaller value o the t-unction cannot depend on an operation associated with a larger value. Thereore i the scheduling unctions t 1) t K) are given the algorithm can be divided into segments with each segment including operations that have the same unctional value. These algorithmic segments can be perormed seuentially one ater another in an ascending order o the unctional values. 1 K J V Let us denote m ξ min J) M ξ max 1 K J V t min ξ τ max J). Note that M ξ m ξ t max ξ J max ) b max a max ξ J min ) J max + N max + τ min b min a min ξ n max j 1 n min j 1 J min N min τ max b max a max ξ M + ) + + τ min b min a min ξ M + ) OM). To the iterations o the algorithm we can assign points not necessarily all) o the n-dimensional parallelepiped C {t 1 t n ) Z n m ξ M ξ 1 ξ n}. One can speciy the ollowing mode o operation or the algorithm: the processor with coordinates t 1 t r ) starts executing operations possibly dummy) at the PRORAMMIN AND COMPUTER SOTWARE Vol. 3 No

3 168 ADUTSKEVICH LIKHODED step t 1 m 1 ) + + t r m r ) + 1; each processor executes its assigned iterations t 1 t r t r + 1 t n ) m r + 1 t r + 1 M r + 1 m n t n M n in a lexicographical order. This order o operation execution is noncontradictory: all operations that govern the given operation have been executed at earlier iterations. The last to carry out the algorithm at the step M 1 m 1 ) + + M r m r ) + 1 is the processor M 1 M r ). The time needed to perorm the assigned iterations is not greater than M r + 1 m r + 1 ) M n m n ). The above reasoning yields the ollowing estimate r or the computation time: T m ξ ) + n M i 1 ξ M m ξ ) OM n r) ). This completes the i r+ 1 ξ proo o the statement. Thus the parallelization scheme described yields the best by the order o magnitude) time or the algorithm realization in the r-dimensional space o virtual processors. It ollows rom the statement proo that the data exchange between any two processors is reduced to the data transer rom one and the same processor to another. The resulting method o data processing can be regarded as a generalization o the classical pipeline processing. 4. STATEMENT O THE PROBLEM K) Let 1 ξ n be n independent sets o L) scheduling unctions and d ξ d ξ 1 ξ r be r sets o unctions o data allocation between processors. Let us determine the limitations that should be ) l imposed on the unctions and d ξ. It ollows rom 1) that or all 1 ξ n the ollowing ineuality should be satisied τ J + b N + α ξ τ α Φ α J + Ψ α N + ϕ α ) ) + b α N + a α ξ and the conditions or dependence conservation can be written as τ ξ ) τ αξ ) Φ α J + b ξ ) τ αξ ) Ψ α b αξ ) )N 3) + τ αξ ) ϕ α ) + a ξ a αξ 0 J V α α ) P 1 ξ n. The operation S J) is assigned or execution to the processor t 1 J) t r J)). The entry a l l J)) o the array a l used or implementing the operation S J) ) l is allocated to the processor d 1 l J) ) l d r l J)) or storing it. Let us consider the values l ) l δ ξ J) J) d ξ l J)) 1 ξ r characterizing or ixed l and J) the distance between the processor that executes the operation and the processor that stores the array entry reuired or the execution. l We have δ J) τ J + b ξ N + a ξ η l J + z l N + y l ξ ) τ J + b l N + a ξ η l l J + l N + l ) ) z l N y l ξ τ η l l )J + b η l l z l )N + a ξ η l l ) y l ξ. This yields the conditions or the distribution o operations and data over processors that reuires no data exchange or any J and N: τ ξ ) η l l 0 b ξ ) η l l z l 0 η l l ) 4) 5) a ξ y l ξ 0. 6) I conditions 4) and 5) are satisied and 6) does not hold the corresponding distribution o operations and data over processors reuires only local i.e. not depending on J and N) communications. Thus it is necessary to obtain n independent sets o ) unctions and r sets o unctions such that 1) or all n sets o conditions 3) are satisied; ) l d ξ ) or r sets o and conditions 4)6) are satisied or as many numbers l and as possible. Let us introduce the ollowing notation: τ 1 τ K η 1 η L b 1 b K z 1 z L a 1 ξ a K ξ y 1 ξ y L ξ ) is a σ-dimensional vector composed o the parameters o ) unctions and ; 0 i j is the null matrix o size i j; E i) is the identity matrix o order i; 0 i) is the null column vector o dimension i; ) i e j is the vector o dimension i with its j coordinate being eual to 1 and the remaining coordinates being nonzero; 0 σ 1 n 0 σ α 1 n Φ α E n ) is the Φ α 0 σ σ ) n 0 σ σ α) n matrix o size σ n ; d ξ l σ 0 0 σ j n i 1 j K j i 1 j σ K + j σ K ν i 1 + j L i 1 σ σ K + L + K + L)e+ K + L; d ξ l PRORAMMIN AND COMPUTER SOTWARE Vol. 3 No

4 A CONSISTENT ENERATION O PIPELINE PARALLELISM 169 Ψ α 0 σ K + L+ 1)e) e 0 σ α 1 e E e) Ψ α 0 K + L )e+ K + L) e 0 σ σ α) e 0 σ K + L+ α 1)e) e E e) 0 K + L α)e+ K + L) e is the matrix o size σ e; 0 σα 1 ) ϕ α ) ϕ α ) σ) σ) + e is the σ K + e σ K + α 0 σ σ α) vector o dimension σ; 0 σ 1 n 0 σ K + l 1 n E n ) is the l 0 σ σ ) n 0 σ σ K + l) n matrix o size σ n ; 0 σ K + L+ 1)e) e 0 σ K + l 1 e E e) l 0 K + L )e+ K + L) e 0 σ σ K + l) e 0 σ K + L+ K + l 1)e) e E e) 0 K + L l)e+ K + L) e is the matrix o size σ e; 0 σ K + l 1) σ) e σ K + + l ) σ) e is the σ L + l 0 σ σ K + l) vector o dimension σ. Using the notation introduced conditions 3)6) can be written in the orm Φ α J + Ψ α N + J V α α ) P ϕ α ) 0 7) 8) 9) 10) The aim o this paper is to develop a method or obtaining vectors 1 ξ n such that conditions ) hold the vectors τ τ n) satisy conditions 7) and the vectors τ τ r) satisy conditions 8)10) or as many numbers l and as possible. 5. NECESSARY AND SUICIENT CONDITIONS OR PRESERVIN THE DEPENDENCES The number o ineualities in the conditions or preserving the dependences 7) is large and depends on the number J in the sets V α as well as on the number o possible values o the vector o outer variables N. The cases where Φ α 0 n ) and Ψ α 0 d) are exceptions; in particular the case τ α τ Φ α is the identity matrix b α b Ψ α is the null matrix the unction o dependences does not depend on the outer variables). In order that conditions 7) could be applied in practice the number o ineualities should be reduced. Let us obtain the necessary and suicient conditions or holding limitations 7). irst let us consider the necessary and suicient conditions or ulilling the auxiliary ineualities xw + u 0 w V Z h 11) where h is a positive integer x Z h and u Z are ixed V { w Z h w i w j + i j i j) K E ; w i g i i K E } 1) K E is a set o index pairs K E is the set o indices not occurring in the pairs o the set K E as the second element g g 1 g h ) is a vector rom the domain V satisying the condition g j g i i j i j) K E. Let m be the number o dierent indices entering into pairs o the set K E. Let us consider a directed graph m containing m nodes v i such that each ineuality w i w j + i j i j) K E is matched by an arc v j v i ). We assume that the graph m has the ollowing properties. There is no more than one path rom one node to another. Each connected component o the graph m contains not more than one node that has several outgoing arcs. I a node v i0 has more than one outgoing arc v i0 v i j ) 1 j l all nodes v i j are suspended have no outgoing arcs) and have a single incoming arc each. Let us consider the system o ineualities x k 0 1 k h. We call this system basic. Let us enumerate all PRORAMMIN AND COMPUTER SOTWARE Vol. 3 No

5 170 ADUTSKEVICH LIKHODED nodes o the graph m and transorm the basic system o ineualities using the ollowing rules. 1) I the node v i0 has paths rom nodes v i1 v i p then replace the ineuality 0 by the ineuality p j 0 0. ) I the node v i0 has arcs connecting it with the nodes v i p + 1 v i p+ l l then add the set o ineualities + 0 in which the p sum is taken by one o the subsets o the set {i p + 1 i p + i p + l } containing more than one element the number o ineualities is eual to the number o subsets). irst rule 1) should be applied; then rule ). The system o ineualities obtained rom the basic system o ineualities by rules 1) and ) is called a transormed basic system o ineualities. Let us present without proo which is rather cumbersome) the ollowing lemma. Lemma. In order that conditions 11) hold or all w V it is necessary and suicient that the conditions xg + u 0 and those given by the transormed basic system o ineualities be satisied. Let us write condition 7) in the orm Φ α Ψ α ) J N) + ϕ α ) 0 13) J V α α ) P where Φ α Ψ α ) is a matrix o size σ n + e) composed o the matrices Φ α and Ψ α J N) is a vector o dimension n + e composed o the vector J o loop parameters and the vector N o the outer parameters. or each pair α ) P we assume that x Φ α Ψ α ) w J N) J V α u ϕ α ) h n + e and V {J N) J V α }. Let the domain V have the orm given by 1). We denote the corresponding vectors g by g α ) α ) and the graph m by m. The basis system o ineualities is composed o the ineualities Φ α Ψ α ) k 0 1 k n + e where Φ α Ψ α ) k denotes the kth column o the matrix Φ α Ψ α ). Let us consider the matrix D α ; its columns are composed o the nonzero vectors Φ α Ψ α )g α ) + ϕ α ) and Φ α Ψ α ) that enter into the transormed basic system o ineualities. Let us take advan- k k tage o the lemma to ormulate the ollowing statement. Proposition. In order that the conditions or conservation o dependences be satisied or a given pair α ) P or all permissible values o the outer parameters it is necessary and suicient that the conditions hold. x i j x j 0 i j x i is s x i0 D α 0 x is i s 14) Remark. The assumptions on the orm o domains V α in Proposition are normally ulilled in practice. Otherwise one should use more general conditions or conservation o dependences see [3 Section 7.1]). 6. THE PROCEDURE O ENERATIN SCHEDULIN UNCTIONS AND UNCTIONS O ALLOCATION O OPERATIONS AMON PROCESSORS Let us introduce the ollowing notation: T 1:0 is a matrix with its rows composed o vectors τ i) 1 i ξ; s Proposition 3. Let rang r r < n and S s S 0 n ). I T 1:ξ Z n S s Z n τ i) s { 0 1 i ξ 1 s 0 } ξ n; T 1:ξ 0 σ 1) s 0 σ K σ + Ke + K) s ξ 15) then rang r + 1. Proo. Let the assumptions o the proposition hold with rang T r. Then the vector τ 1:ξ can be represented as a linear combination o vectors τ i) 1 i ξ 1; i.e. τ ξ 1 λ i τ i). It ollows that i 1 τ ξ ) ξ 1 s λ i τ i) s 0; hence s ξ 0. i 1 The contradiction obtained implies that rang T 1:ξ r + 1. This completes the proo o the statement. Condition 15) can be used or inding n vectors determining n independent i.e. satisying condition )) K) sets o t-unctions. To do this in the consecutive search or vectors ξ 1 n care must be taken to make sure that condition 15) holds or such that n ξ + 1 n rang T 1:ξ 1. I we introduce additional vector variables z α ineualities 14) can be reduced to the euations D α z α 0 z α 0. 16) According to the problem statement or all n vectors conditions 16) must be satisied. In addition the smaller the value o coordinates o z α the smaller T 1:ξ 1 1 s ξ S ). PRORAMMIN AND COMPUTER SOTWARE Vol. 3 No

6 A CONSISTENT ENERATION O PIPELINE PARALLELISM 171 the dierences J) I) J V α I Φ α J) and potentially smaller the time o the algorithm realization. Thus the conditions or preserving the dependences and the problem o reducing the time o algorithm realization can be reduced to minimization nulliication i possible) o vectors z α satisying conditions 16). The problem o allocating operations and data among the processors only or local communications can be reduced see 8)10)) to the problem o nulliication o vectors and satisying the conditions 17) 18) where the notation v means the vector composed o the absolute values o the entries o the vector v and the problem o allocating operations and data among the processors that does not reuire data exchanges can be reduced to the nulliication o vectors the conditions satisying 19) i we manage to ulill the eualities and 0 e). α) It ollows rom here that the problem o consistent generation o the pipeline parallelism and distribution o operations and data among the processors can be reduced to the problem o searching or n vectors based on the conditions o minimization or vectors z α ; nulliication o vectors and or as many numbers l and as possible; and nulliication o the values i the vectors and have been nulli- ied) upon ulillment o constraints 16)19) and constraint 15) or satisying the condition n ξ + 1 n rang. T 1:ξ 1 Let us introduce the ollowing notation: D is the set o matrices D α ; 0 n ) D and D are the sets o matrices and respectively entering into the conditions o data distribution among the processors reuiring only local communications; D is the set o vectors entering into the conditions o data distribution between the processors without data exchange; L { n ξ + 1 n rangt 1:ξ 1 }; ρ z α ) λ α α z α + λ l + λ l + λ l z ) l l where the sum is taken by all α and such that α D α D and the sum is taken by all l and such that D D D ; λ α λ l λ l and λ l are the vectors o weighting coeicients. Each column o the matrices D α and and vectors are matched by some coordinate o the weighting coeicients. I any column o the matrices or vector is repeated γ times then the greater γ the larger the role o these columns or vectors in the minimization o the time or algorithm realization and generation o the distribution operations and data that either does not reuire data exchanges between the processors or reuire only local communications and the larger the value o the corresponding coordinates o weighting coeicients. The weighting coeicients can also express preerences in choosing an allocation o operations and data that ensures that there are no data exchanges between the processors or that only local communications are needed. I or some reason one needs an allocation without exchanges between elements o some data or the exchange reuires only local communications) the columns o the matrices and and vectors corresponding to this array should be associated with greater weighting coeicients compared to those associated with the columns o the matrices and vectors corresponding to other arrays. The smaller the value o the variables z α and the smaller the value o the unction ρ. Thus the vector satisying the posed conditions can be ound as a solution o the ollowing optimization problem. I ξ r then set λ l 0 or all l and ; choose a vector s L S ; and minimize the value o the unction ρ under conditions 16)19) and condition 15) or L. I the solution o the optimization problem satisies the condition 0 n ) 0 e) but 0 or some l and then or these values o l and set the weighting coei- l PRORAMMIN AND COMPUTER SOTWARE Vol. 3 No

7 17 ADUTSKEVICH LIKHODED λ l cients eual to nonzero values and minimize again the value o ρ under the same conditions. I ξ > r then choose a vector s L S and minimize the value o the unction ρ under the conditions 16) and condition 15) or L. The greater the weighting coeicients corresponding to the coordinates o vectors z α and values the more the role o minimization o these values in the minimization o ρ. Thereore when choosing vectors s in condition 15) one should seek to ensure that they are not collinear to the vectors determined by columns o the matrices D α and vectors having high priorities. In searching or the vector τ condition 15) is not obligatory; however it allows one to avoid a trivial solution. I the solution o the posed optimization problem satisies the conditions z 0 e) l 0 n ) and 0 or some l and then the resulting allocation o operations and data among the processors does not reuire data exchanges between processors or these l and. I the euality 0 is not satisied or some l and the resulting allocation o operations and data over processors provides only local communications or those l and that 0 n ) 0 e). I or some l and the ineualities 0 n or z 0 e) l are satisied the resulting allocation o operations and data over processors reuires nonlocal i.e. depending on J or N) communications or these l and. The indings reported in this paper are summed by the ollowing procedure. The aim o this procedure is to K) obtain n sets o unctions and r sets o L) unctions d ξ d ξ satisying the conditions or conservation o dependences and i possible either the conditions that there are no data exchanges between processors or the conditions that only local communications are needed. This is a recursive procedure containing n recursions. The result o the ξth recursion is the vector. Procedure obtaining scheduling unctions and unctions o allocation o operations and data over processors). irst se 1. Step 1. Choose a vector L. ind a vector as a solution o the optimization problem setting s S min{ ρ z α ) D α z α 0 z α 0 D α D 0 D ξ r 0 D ξ r 0 D ξ r λ l s ξ 0 or all l and i ξ r. I ξ r and there are l and such that z 0 e) l 0 n but 0 then go to Step ; otherwise go to Step 3. Step. or l and such that 0 e) but 0 set the weighting coeicients λ l eual to nonzero values and ind again a vec- tor as a solution o the optimization problem. Step 3. Determine the set L ξ + 1) { n ξ n rang T 1:ξ }. Step 4. I ξ n then uit the techniue; otherwise incremen by 1 and go to Step EXAMPLE Let us consider the algorithm o multiplication o three matrices o order N: X ABD A B D Z N N. or i 1 to N) do or j 1 to N) do or k 1 to N) do S 1 : c[ij] c[ij] + a[ik]*b[kj]; or i 1 to N) do or j 1 to N) do or k 1 to N) do S : x[ij] x[ij] + c[ik]*d[kj]. The algorithm dependences are determined by the unctions Φ 11 1 L } i j k) Φ i j k) E 3) i j k) T 0 0 1) T V 11 V z l { i j k) Z 3 1 i j N k N} 0 n ) PRORAMMIN AND COMPUTER SOTWARE Vol. 3 No

8 A CONSISTENT ENERATION O PIPELINE PARALLELISM 173 Φ 1 i j k) V and are generated by the arrays c x and c respectively. To each data we assign a value o the index l: l 1 to the array c l to the array a l 3 to the array b l 4 to the array x and l 5 to the array d. We have σ τ 1 τ τ 3 τ τ τ 3 η 1 η η 1 η η 1 η η b 1 b z 1 z 1 η η 1 η z 3 z 4 z 5 a 1 ξ a ξ y 1 ξ y ξ y 3 ξ y 4 ξ y 5 ξ ) i j k N { i j k) Z 3 1 i j k N} Φ 1 Φ 11 Φ Ψ 11 Ψ ) ) 30 Ψ 1 e 3 e 17 + e 18 ϕ 11 ) 30 e 3 ϕ ) e ) 6 ϕ 1 ) ) 30 e 4 + e T T T T T T ) e 17 e ) ) e 17 ) e 17 e ) ) 1 41 e 18 ) e 18 e ) ) e 18 ) e 4 e ) ) 6 11 e 4 ) e 4 e ) ) 8 41 e 5 ) e 5 e ) ) 6 51 e 5 e ) 0 Let us consider the domains V 1 1 and V. Since Φ ii Ψ ii ) i 1 then D 1 1 ϕ 11 ) ) e 3 D ϕ ) ) e 6. Let us consider the domain V 1. or all J i j k) V 1 the ineualities i N j N and k N hold. We have J N) i j k N) g 1 ) ). The 1 ) directed graph 4 has the orm presented in the igure. The basic system o ineualities has the orm Φ 1 Ψ 1 ) k 0 1 k 4. Let us take advantage o rules 1) and ) and transorm the basic system o ineualities: Φ 1 Ψ Φ 1 Ψ 1 ) 1 + Φ 1 Ψ 1 ) 4 ) 0 Φ 1 Ψ 1 ) + Φ 1 Ψ 1 ) 4 ) 0 Φ 1 Ψ 1 ) 3 + Φ 1 Ψ 1 ) 4 ) 0 e ) e ) 3 e ) 7 e ) 9 e ) 30. Φ 1 Ψ 1 ) 1 + Φ 1 Ψ 1 ) + Φ 1 Ψ 1 ) 4 ) 0 PRORAMMIN AND COMPUTER SOTWARE Vol. 3 No

9 174 ADUTSKEVICH LIKHODED N λ 11 λ 311 λ ) λ 111 λ 41 λ i j k λ 11 λ 311 λ 51 1 Φ 1 Ψ 1 ) 1 + Φ 1 Ψ 1 ) 3 + Φ 1 Ψ 1 ) 4 ) 0 Φ 1 Ψ 1 ) + Φ 1 Ψ 1 ) 3 + Φ 1 Ψ 1 ) 4 ) 0 Φ 1 Ψ 1 ) 1 + Φ 1 Ψ 1 ) + Φ 1 Ψ 1 ) 3 + Φ 1 Ψ 1 ) 4 ) 0. We obtain the ollowing matrix D 1 the irst column is the vector g 1 ) + ϕ 1 ) ): D 1 Φ 1 Ψ Let us choose the weighting coeicients λ. The vectors e 3 and e 6 are encountered as columns o the ) ) matrices D and D 41 respectively) twice each and the remaining vectors are encountered once. As the algorithm dependences are generated by the arrays c and x it is desirable to eliminate exchanges between entries o these arrays; thereore to the columns o the matrices and 4 1 we assign larger weighting coeicients. Thus we deine the weighting coeicients in the ollowing way: λ 11 λ 10 λ ) λ 111 λ 41 λ 11 igure ) ) λ 111 λ 11 λ 41 λ 11 λ 311 λ 51 Let us apply the proposed techniue or r. We have D { D 11 D D 1 } Recursion 1 ξ 1): 3) Step 1. Choose s 1 s e 1. Set λ 111 λ 11 λ 311 λ 41 λ 11 λ Solving the optimization problem we obtain: τ ). Since there are no l and such that 0 n but 0 go to Step 3. Step 3. L ) L 1). Step 4. ξ. Recursion ξ ): Step 1. Choose. Set λ 311 λ 41 λ 11 λ Solving the optimization problem we obtain: τ ) ). Since there are no l and such that 0 n but 0 go to Step 3. Step 3. L 3) L ). Step 4. ξ 3. Recursion 3 ξ 3): 3) 3) 3) Step 1. Choose s 1 s e 3. Solving the optimization problem we obtain: ). Since ξ > r go to Step D { } D { } D { } λ 11 τ 3) 0 e) L 1 s 1 0 e) { }. s e 3 1. λ 111 PRORAMMIN AND COMPUTER SOTWARE Vol. 3 No

10 A CONSISTENT ENERATION O PIPELINE PARALLELISM 175 Step 3. L 4) L 3). Step 4. The procedure is completed. Thus we obtained the ollowing unctions giving the allocation o operations and data between the processors and deining the order o operation execution by the processors: d 1 d t 1 i j k) i t 1 i j) i d 1 t In accordance with the obtained unctions 1 ) l 1 ξ and d ξ 1 l 5 1 ξ we obtain a mapping o operations and data onto the two-dimensional space o virtual processors: The operations S 1 i j k) and S i j k) are mapped to the processors i j) and i j + N) respectively; ci j) to i j); ai j) to i 0); bi j) to 0 j); xi j) to i j + N); di j) to 0 j + N). In the course o the irst recursion the eualities 0 3) z 111 z 11 z 41 z 11 z 111 z 11 z 41 z 11 0 and z 111 z 11 z 41 z 11 0 mean that there is no need or data exchanges between elements o the arrays c a and x along the irst coordinate. The ineualities 0 3) and 0 3) mean that there are nonlocal depending on J) communications along the irst coordinate or the data exchange between the i j k) i 3) i j) i d 1 i j) 0 4) d 1 5 i j) i d 1 i j k) j t i j) j d 4) d t 3 z 51 i j) 0 i j k) j + N 3) i j) 0 d i j) j 5 i j) j N d + i j) j + N i j k) k t 3 i j k) k + N. z 311 arrays b and d. In the course o the second recursion the eualities 0 3) z 311 z 41 z 51 z 311 z 41 z 51 0 and z 311 z 41 z 51 0 mean that the data exchanges between elements o the arrays b x and d along the second coordinate are not reuired. The ineualities 0 3) z and 0 3) 11 z 11 mean that there are nonlocal depending on J) communications along the second coordinate or data exchange between the arrays c and a. In accordance with the allocation unctions or the operations 1 1 ξ and scheduling unctions t 3 1 one can write a code in the same way or each processor) designed or executing the algorithm on N N processors. The symbols p1 and p denote the spatial coordinates o a processor. The loop variable t corresponds to the algorithm scheduling. i 1 p1 N) then i 1 p N) then or t 1 to N do i 1 p N) then S 1 : c[p1p] c[p1p] + a[p1t]*b[tp]; i N + 1 p N) then or t N + 1 to N do i N + 1 p N) then S : x[p1p N] x[p1p N]+ c[p1t N]* d[t Np N]; To write the SPMD code o the algorithm it is necessary to set a transer mode or data that reuire nonlocal communications the arrays c a b and d) or their exchange and to determine the relevant synchronization. Since three independent sets o t-unctions were obtained the blocking techniue can be applied to this algorithm. Let us denote the block size by B. The symbols ib and jb denote the spatial coordinates o a processor and the symbols p1 and p which earlier denoted the spatial coordinates o a processor are now loop variables and control the order o operation execution on each processor. i 1 ib N/B ) then or p1 ib 1)B + 1 to minib*bn) do i 1 jb N/B ) then or p jb 1)B + 1 to minjb*bn) do or tb 1 to N/B do or t tb 1)B + 1 to mintb*bn) do S 1 : c[p1p] c[p1p] + a[p1t]*b[tp]; i N/B + 1 jb N/B ) then or p maxjb 1)B + 1 N) to minib*bn) do PRORAMMIN AND COMPUTER SOTWARE Vol. 3 No

11 176 ADUTSKEVICH LIKHODED or tb N/B + 1 to N/B do or t maxtb 1)B + 1 N) to mintb*bn) do S : x[p1p N] x[p1p N]+c[p1t N]*d[t Np N]; 8. CONCLUSIONS A method or generating pipeline parallelism and obtaining a consistent solution o the problem o spatial and temporal mapping o operations and data onto virtual processors has been proposed. In addition practically useul necessary and suicient conditions or preserving algorithm dependences have been obtained. The method has the ollowing advantages: the original algorithm is represented by aine loop nests o arbitrary nesting structure; the techniue o consistent distribution o operations and data among the processors is independent o particular values o outer variables; the unctions giving the pipeline parallelism and data allocation among the processors depend on outer variables parametrically; the suitability or automation: the method is ormalized and allows or sotware implementation; the possibility o obtaining alternative solutions owing to setting priorities or the conditions or preserving dependences and conditions or generating allocations o operations and data that either do not reuire data exchanges between the processors or reuire only local communications; the possibility o obtaining inormation about the availability or absence o exchanges between entries o each data and on the character o the communication local depending on loop parameters depending on outer variables) called by each data; the possibility o adding new conditions or example improving the data locality); the possibility o obtaining block versions o the algorithm. REERENCES 1. Darte A. and Robert Y. Aine-by-Statement Scheduling o Uniorm and Aine Loop Nests over Parametric Domains J. Parallel Distrib. Computing 1995 vol. 9 no. 1 pp eautrier P. Some Eicient Solutions to the Aine Scheduling Problem Int. J. Parallel Programming 199 vol. 1 nos. 56 pp Voevodin V.V. and Voevodin Vl.V. Parallel nye vychisleniya Parallel Computing) St. Petersburg: BHV Dion M. and Robert Y. Mapping Aine Loop Nests Parallel Computing 1996 vol. pp rolov A.V. Optimization o Arrays Allocation in ORTRAN Programs or Multiprocessor Computing Systems Programmirovanie 1998 no. 3 pp Lee H.J. and ortes J.A.B. Automatic eneration o Modular Time-Sparse Mappings and Data Alignments J. VLSI Signal Processing 1998 vol. 19 pp Likhoded N.A. Distribution o Operations and Data Arrays over Processors Programmirovanie 003 no. 3 pp Darte A. and Robert Y. Mapping Uniorm Loop Nests onto Distributed Memory Architectures Parallel Computing 1994 vol. 0 pp Lim A.W. and Lam M.S. Maximizing Parallelism and Minimizing Synchronization with Aine Partitions Parallel Computing 1998 vol. 4 nos. 34 pp Lim A.W. and Lam M.S. An Aine Partitioning Algorithm to Maximize Parallelism and Minimize Communication Proc. o the 1st ACM SIARCH Int. Con. on Supercomputing Bakhanovich S.V. and Likhoded N.A. A Method or Parallelizing Algorithms by Vector Scheduling unctions Programmirovanie 001 vol. 7 no. 4 pp rolov A.V. inding and Use o Oriented Cuts o Real Algorithms raphs Programmirovanie 1997 no. 4 pp Lim A.W. Liao S.-W. and Lam M.S. Blocking and Array Contraction across Arbitrary Nested Loops Using Aine Partitioning Proc. ACM SIPLAN Symp. on Principles and Practice o Programming Languages 001. PRORAMMIN AND COMPUTER SOTWARE Vol. 3 No

Supplementary material for Continuous-action planning for discounted infinite-horizon nonlinear optimal control with Lipschitz values

Supplementary material for Continuous-action planning for discounted infinite-horizon nonlinear optimal control with Lipschitz values Supplementary material or Continuous-action planning or discounted ininite-horizon nonlinear optimal control with Lipschitz values List o main notations x, X, u, U state, state space, action, action space,