2 AN OVERVIEW OF THE TENSOR PRODUCT

Size: px

Start display at page:

Download "2 AN OVERVIEW OF THE TENSOR PRODUCT"

Bertram Lane
5 years ago
Views:

2 98 IEEE TRASACTIS PARALLEL AD DISTRIBUTED SYSTEMS, VL 10, 3, MARCH Th choic of data distribution has a larg influnc on th prformanc of th synthsizd programs, ur simpl algorithm for slcting th appropriat data distribution siz is vry ffctiv, and 3 Th dynamic programming approach can always rduc th numbr of passs to accss out-of-cor data Th papr is organizd as follows: Sction introducs tnsor products and discusss formulation of block rcursiv algorithms using tnsor products and othr matrix oprations In Sction 3, w introduc a two-lvl computation modl and prsnt th smantics of data distributions and data accss pattrns Sction prsnts an ovrviw of our out-of-cor program synthsis framwork In Sction 5 and Sction 6, w summariz th prformanc rsults and show th ffctivnss of using various blockcyclic data distributions Prformanc rsults ar prsntd in Sction 7 Finally, conclusions ar providd in Sction 8 A VERVIEW F THE TESR PRDUCT In this sction, w illustrat th formulation of block rcursiv algorithms using tnsor products W bgin with som prliminary dfinitions which ar ssntial for undrstanding th rst of th papr 1 Prliminaris Th tnsor product is usful in xprssing th block structur in a matrix Lt A b an m n matrix and B b a p q matrix Th tnsor product A B is a block matrix obtaind by rplacing ach lmnt a i;j by th matrix a i;j B, i, 3 A m;n B p;q ˆ 6 a 0;0 B p;q a 0;n 1 B p;q a m 1;0 B p;q a m 1;n 1 B p;q 7 5: Th abov tnsor product can b factorizd as follows: A m;n B p;q ˆ A m;n I p I n B p;q ˆ I m B p;q A m;n I q ; whr I n rprsnts th n n idntity matrix A tnsor factorization can b usd to fficintly comput Y mp obtaind by applying C mp;nq A m;n B p;q to vctor X nq, i, Y mp ˆ C mp;nq X nq For xampl, dirct application of C mp;nq to X nq rquirs mpnq scalar oprations Howvr, th following algorithm basd on th tnsor factorization of C mp;nq : Z mq ˆ A m;n I q X nq ; Y mp ˆ I m B p;q Z mq, rquirs only qmn mpq scalar oprations A tnsor product involving an idntity matrix can b implmntd as paralll oprations For xampl, considr th application of I m A p;n to X mn, i, A p;n 3 X mn 3 0:n X 7 mn n :n A p;n 5 ˆ X mn m 1 n : mn 1 A p;n X mn 3 0:n 1 p;n X mn n :n : A p;n X mn m 1 n : mn 1 This can b intrprtd as m copis of A p;n acting in paralll on m disjoint sgmnts of X mn Howvr, to intrprt th application A p;n I m to X mn as paralll oprations, w nd to undrstand strid prmutations (aka shuffl prmutations) of a vctor X mn is a vctor Y mn, whr Y mn ˆ X mn 0:mn 1:n ; X mn 1:mn 1: n ;;X mn m 1:mn 1:n Š; i, th first m lmnts of Y mn ar X mn 0:mn 1:n, which rprsnts lmnts Th strid prmutation L mn n of X mn at strid n starting with lmnt 0 Th nxt m lmnts ar lmnts of X mn at strid n, starting with lmnt 1 Th strid prmutation L mn n can b rprsntd as an mn mn transformation For xampl, th ffct of applying L 6 to X6 can b xprssd in matrix form as follows: x 0 x x 1 x L X6 ˆ x x x 3 ˆ 6 7 x 1 : x 5 x x 5 x 5 Strid prmutations can also b dfind in trms of a prmutation of th tnsor product of vctor bass A vctor basis m i, 0 i<m, is a column vctor of lngth m with a on at position i and zros lswhr Th tnsor product of vctor bass is calld a tnsor basis A tnsor basis m1 i 1 m t i t can b linarizd into a vctor basis m 1m t i 1m m t i t 1m t i t Equivalntly, a vctor basis M i can b factorizd into a tnsor product of vctor bass m1 i 1 m t i t, whr M ˆ m 1 m t and i k ˆ i div M k 1 mod m k ;M k ˆ Qt iˆk m i;m t 1 ˆ 1: For xampl, 1 L mn n 8 ˆ : Th strid prmutation can now b dfind as: L mn n m i n j ˆ n j m i : This givs th rlationship btwn th indxing of th input and th output vctors By linarizing th input tnsor basis m i n j to mn in j, w gt th indxing function of th input vctor to b in j Similarly, th indxing function of th output vctor is obtaind by linarizing th output tnsor basis to b jm i Thrfor, th ffct of applying th strid prmutation L mn n to an input vctor is that th lmnt at indx in j of th input vctor is stord in location at indx jm i of th output vctor Using strid prmutations, an application of A p;n I m to X mn can also b intrprtd as m paralll applications of

3 LI ET AL: SYTHESIZIG EFFICIET UT-F-CRE PRGRAMS FR BLCK RECURSIVE ALGRITHMS USIG BLCK-CYCLIC DATA 99 A p;n to disjoint sgmnts of X mn by using th idntity L pm m Ap;n I m ˆ I m A p;n L mn m as follows: Lpm m Y pm ˆ I m A p;n L mn m Xmn ; i, Y pm 3 0:pm 1:m Y pm 1:pm 1:m ˆ Y pm m 1:pm 1:m 3 A p;n X mn 0:mn 1:m A p;n X mn 1:mn 1:m : A p;n X mn m 1:mn 1:m Howvr, th inputs for ach application of A p;n ar accssd at a strid of m and th outputs ar also stord at a strid of m Th proprtis of tnsor products can b usd to transform th tnsor product rprsntation of an algorithm into anothr quivalnt form, which can tak th advantag of th paralll oprations discussd abov For xampl, by using th following tnsor product factorizations, A m;n B p;q ˆ A m;n I p I n B p;q ˆ I m B p;q A m;n I q ; A B can b implmntd by first applying q paralll applications of A and thn m paralll applications of B 1 Svral othr ky proprtis of tnsor products ar listd blow [1]: 1 A B C ˆ A B C ˆ A B C; A B C D ˆAC BD; assum that th ordinary multiplications AC and BD ar dfind Q 3 n 1 iˆ0 I Q m Ai ˆI m n 1 iˆ0 A i ; Q n 1 iˆ0 A i Im ˆ Q n 1 iˆ0 A i I m Proprty is also calld factor grouping Proprtis 3 and follow from Proprty Tnsor Product Formulation of Block Rcursiv Algorithms A block rcursiv algorithm is obtaind from a rcursiv tnsor factorization of a computation matrix For xampl, FFT algorithms ar drivd by tnsor factorization of th discrt Fourir transform (DFT) matrix Th algorithms obtaind from tnsor factorization ar computationally mor fficint than thos that dirctly us th unfactorizd matrix For xampl, computing th DFT of a vctor of siz by dirctly multiplying it by an DFT matrix rquirs oprations compard to only log oprations using an FFT algorithm Som othr xampls of block rcursiv algorithms ar Strassn's matrix multiplication [11], [13], convolution [8], and fast sin/cosin transforms [18] A tnsor product formulation of a block rcursiv algorithm has th following gnric form: 1 W ignor th dimnsions of matrics whnvr thy ar clar from th contxt 1 Y k jˆ1 I rj A vj I cj ; whr A vj is a v j v j squar linar transformation, Q k iˆ1 F i dnots F k F 1, and r j v j c j ˆ r i v i c i, for 1 i; j k Th computation prformd at ach stp j is U j ˆ I rj A vj I cj V j Du to th prsnc of idntity trms, it is asy to xprss ach computation stp using paralll oprations Howvr, th task of harnssing this inhrnt paralllism in ach computation stp with th goal of minimizing th paralll I/ oprations is nontrivial W nxt prsnt tnsor product formulations of two FFT algorithms which ar usd as xampls in this papr 1 Fast Fourir Transform Th tnsor product formulations of various FFT algorithms ar prsntd in [1], [18] Ths formulations ar obtaind by diffrnt tnsor factorizations of th discrt Fourir transform matrix Although all of ths algorithms ar computationally quivalnt, thy hav diffrnt computational structurs and diffrnt data accss pattrns For xampl, considr th following tnsor product formulation of th radix- dcimation-in-tim Cooly-Tuky FFT: and F n ˆ Yn I n i F I i 1 I n i iˆ1 R n ˆ Yn iˆ1 I i 1 L n i 1 ; F ˆ 1 1 : 1 1 T i i 1!R n; 3 T i rprsnts a diagonal matrix of constants and R i 1 n prmuts th input squnc to a bit-rvrsd ordr As can b sn from (3), for an FFT on n points, thr ar n stps in th computation aftr prforming th initial bit-rvrsal prmutation At ach stp, th data array from th prvious stp is scald by multiplying by twiddl factors T i i 1 X i 1, followd by th buttrfly compu- Y ˆ I n i tation X i ˆ I n i F I i 1 Y Matrix Transposition Th transposition of a p q matrix M p;q can b xprssd using a strid prmutation L pq q as M pq T ˆ L pq q Mpq, whr M pq is th row-major linar rprsntation of M p;q Various matrix transposition algorithms can b xprssd using tnsor product formulas involving strid prmutations [10] For xampl, th block matrix transposition algorithm for transposing a p q matrix can b dscribd by th following formula: L pq q ˆ I q L p q 1 q Ip1 1 L p q q Ip1q 1 I pq L p 1 q 1 q 1 I p L p 1 q q Iq1 ; whr p ˆ p p 1 and q ˆ q q 1 Th first (rightmost) factor convrts th row-major rprsntation of th input matrix to a row-major rprsntation of th input matrix viwd as a p q block matrix consisting of p 1 q 1 siz blocks Th

4 300 IEEE TRASACTIS PARALLEL AD DISTRIBUTED SYSTEMS, VL 10, 3, MARCH 1999 scond and third factor xprss transposition of ach block and transposition of th block matrix, rspctivly Th fourth factor is th invrs of th first and it rvrts th block row-major rprsntation to row-major rprsntation of th output Th corrctnss of this rprsntation can b sn by applying th factors to th input basis s p i p 1 i 1 q j q 1 j 1 to gt th following squnc of bass, s! P i q j p 1 i 1 q 1 j 1! p i q j q 1 j 1 p 1 i 1! q j p i q 1 j 1 p 1 i 1! q j q 1 j 1 p i p 1 i 1 ˆ t ; and noting that t ˆ L pq q s ot that w hav usd th idntity A m;n B p;q n i q j ˆAm;n n i B p;q q j : Th basis t is calld th output basis 3 PARALLEL I/ MDEL WITH BLCK-CYCLIC DATA DISTRIBUTIS W us a two-lvl modl which is similar to Vittr and Shrivr's two-lvl mmory modl [] Howvr, in our modl th data on disks (calld out-of-cor data) can b distributd in diffrnt (logical) block sizs Th modl consists of a procssor with an intrnal random accss mmory and a st of disks Th storag capacity of ach disk is assumd to b infinit n ach disk, th data is organizd as physical block with fixd siz Four paramtrs: (th siz of th input), M (th siz of th intrnal mmory), B d (th siz of ach physical block), and D (th numbr of disks) ar usd in this modl W assum that M<, 1 B d M, and 1 D M B d In this modl, disk I/ occurs in physical tracks (dfind blow) of siz B d D Th physical blocks which hav th sam rlativ positions on ach disk constitut a physical track Th physical tracks ar numbrd contiguously with th outrmost track having th lowst addrss and th innrmost track having th highst addrss Th ith physical track is dnotd by T i Fig 1 shows an xampl data layout with B d ˆ, D ˆ, and ˆ 6 Each paralll I/ opration can simultanously accss D physical blocks, on block from ach disk Thrfor, paralllism in data accss is at two lvls: Elmnts in on physical block ar transfrrd concurrntly and D physical blocks can b transfrrd in on I/ opration In this papr, w us th stripd disk accss modl in which physical blocks in on I/ opration com from th sam track, as opposd to th indpndnt I/ modl in which block can com from diffrnt tracks W us th paralll primitivs, paralll_rad(i) and paralll_writ(i), to dnot th rad and writ to th physical track T i, rspctivly W dfin th masur of I/ prformanc as th numbr of paralll I/s rquird 31 Block-Cyclic Data Distributions Block-cyclic distributions hav bn usd for distributing arrays among procssors on a multiprocssor systm A block-cyclic distribution partitions an array into qual sizd blocks of conscutiv lmnts and thn maps thm onto th procssors in a cyclic mannr If w rgard th disks in th abov modl as procssors, thn th data organization dscribd abov (g, in Fig 1) is xactly a block-cyclic distribution (dnotd as cyclicb d ) with th block siz B d Morovr, w can assum that data can b distributd with an arbitrary block siz Fig shows th data organization for th sam paramtrs as in Fig 1, but with a cyclic 8 distribution otic that th siz of th physical track and th siz of th physical block ar not changd Howvr, thy contain diffrnt rcords W will call B rcords in a block formd by a cyclic B distribution a logical block Similarly, th logical blocks which hav th sam rlativ positions on ach disk consist of a logical track Th ith logical track is dnotd as LT i ot that ach paralll I/ opration still accsss a physical track not a logical track Hnc, svral paralll I/ oprations ar ndd to accss a logical track For xampl, to load th logical track LT 1 in Fig, two paralll_rad oprations paralll rad and paralll rad 3, which, rspctivly, load th physical tracks T and T 3, ar ndd W nxt us a simpl xampl to show th advantags of using logical distributions on dvloping I/-fficint programs for block rcursiv algorithms Why Logical Data Distributions? Assum that w want to implmnt F 8 I8 on our targt modl undr th paramtrs givn in Fig 1 Furthr, w assum that th siz of th main mmory is th half of th siz of th inputs Bcaus w ar mainly intrstd in data accss pattrns, w ignor th ral computations conductd by F 8 Th only thing w nd to rmmbr is that F 8 nds ight lmnts with a strid of ight bcaus of th xistnc of th idntity matrix I 8 W first considr implmnting F 8 I8 on th physical block distribution From th abov discussion, w know that th first F 8 nds to b applid to ight lmnts: 0; 8; 16; ; 8; 3; 0; 8, and 56 From Fig 1, w can s that ths lmnts rquird by th F 8 computation ar stord on four physical tracks Howvr, our main mmory can hold only two physical tracks, so that w can not simply load all of th four physical tracks into th main mmory and accomplish th computation in on pass of I/ To gt around this mmory limitation, w can us two diffrnt approachs First, w load th first physical track and kp th first half of th rcords in ach physical block in that loadd physical track and throw othr half of th rcords W do this for vry othr physical track Thn w do th computation for half of th rcords in th main mmory Aftr finishing computation for half of th rcords, w writ th rsults out Thn w rpat th abov procdur Howvr, w now kp othr half of th rcords in th main mmory for ach loadd track By doing computation in this way, it is obvious that w nd two passs to load out-ofcor data Anothr mthod is to us a logical block distribution Lt th siz of a logical block b ight, as shown in Fig In this cas, th ight rcords rquird by on F 8 ar stord on two Cormn has calld this data organization on disks as a bandd data layout [3] and studid th prformanc for a class of prmutations and svral othr basic primitivs of ESL languag [1]

LI ET AL: SYTHESIZIG EFFICIET UT-F-CRE PRGRAMS FR BLCK RECURSIVE ALGRITHMS USIG BLCK-CYCLIC DATA 301 Fig 1 Th data organization for ˆ 6 inputs with B d ˆ and D ˆ Each column is a disk Each box is a

5 LI ET AL: SYTHESIZIG EFFICIET UT-F-CRE PRGRAMS FR BLCK RECURSIVE ALGRITHMS USIG BLCK-CYCLIC DATA 301 Fig 1 Th data organization for ˆ 6 inputs with B d ˆ and D ˆ Each column is a disk Each box is a physical block Each row consists of a physical track Th numbrs in ach box dnot th rcord indics physical tracks, ithr T 1 and T 3 or T and T Thrfor, if w can first load and prform computation on T 1 and T 3, followd by loading and prforming computation on T and T, thn th ntir computation can b prformd in a singl pass Hnc, logical distribution can b usd to rduc th numbr of passs ndd to prform th ntir computation Howvr, thr ar svral issus which nd to b addrssd, such as how to dtrmin th block siz of th logical distribution and how to dtrmin th data accss pattrns W will discuss ths issus in th following sctions For simplicity, w mak th following assumptions: Th input and th output data ar stord in sparat st of disks All paramtrs ar powr of two 3 Th block siz B of th distribution is a multipl of B d 3 Smantics of Data Distributions and Accss Pattrns A block-cyclic distribution can b algbraically rprsntd by a tnsor basis by idntifying th bass which corrspond to procssor indx [9] This approach can b adaptd to rprsnt data distributions onto disks in our paralll I/ modl by substituting disks for procssors Howvr, du to th xistnc of physical blocks and physical tracks, th mthod of using tnsor bass to dfin a block-cyclic distribution for multiprocssors nds to b gnralizd This w achiv by furthr factoring th tnsor basis to gt bass for physical block indx and physical track indx W call this factord tnsor basis an (out-of-cor) data distribution basis, which is dfind as follows: Dfinition 31 Lt B ˆ B b B d If a vctor of lngth, whr = GBD and G is an intgr, is distributd according to th cyclic B distribution on D disks, thn its data distribution basis is dfind as: For xampl, th data distribution basis for Fig is g d bb bd, whr th siz of ach physical block is four, ach logical block contains two physical blocks, thr ar four disks, and th inputs ar stord on two logical tracks Th data distribution basis for Fig 1 can b writtn as g d bd, whr B b ˆ 1 A slctd portion of th distribution basis in (5) can b usd to obtain th indxing function ndd to dnot a particular data unit such as a logical track or a physical track Lt logical-track D ˆ G g physical-track D ˆ G g B b b b 7 Thn th indxing function for accssing th physical tracks can b obtaind by linarizing physical-track D Similarly, w can hav tnsor bass which dnot th rcords insid a logical track and a physical track, rspctivly Ths tnsor bass ar calld th logical tracklmnt basis ( D d B b b b B d b d ) and th physical track-lmnt basis ( D d B d b d ), rspctivly An intrsting point to not is that th logical track-lmnt basis can b obtaind by dlting th bass corrsponding to th logical track indx from th data distribution basis D Similarly, th physicaltrack lmnt basis can b obtaind by dlting th bass corrsponding to th physical track indx from th data distribution basis D Formally, logical-track-lmnt D ˆ D logical-track D ; 6 Dˆ G g D d B b b b B d b d : W us D s to rfr to th sth factor (from th lft), g, D ˆ D d 5 and physical-track-lmntbasis D ˆ D physical-track D ; 3 Th rsults can b asily gnralizd to all paramtrs to b powr of any positiv intgr whr th basis diffrnc oprator, dnotd as -, is dfind as:

30 IEEE TRASACTIS PARALLEL AD DISTRIBUTED SYSTEMS, VL 10, 3, MARCH 1999 Fig Th data organization for ˆ 6 inputs with B d ˆ, D ˆ, and B ˆ 8 Each column is a disk Th first lft shadowd box dnots an

6 30 IEEE TRASACTIS PARALLEL AD DISTRIBUTED SYSTEMS, VL 10, 3, MARCH 1999 Fig Th data organization for ˆ 6 inputs with B d ˆ, D ˆ, and B ˆ 8 Each column is a disk Th first lft shadowd box dnots an xampl logical block Thr ar two logical tracks, LT 0 and LT 1, ach of thm consists of two physical tracks Dfinition 3 Lt S and G b two tnsor bass Thir diffrnc is dnotd as S Gand is a tnsor basis which is constructd by dlting all of th vctor bass in G from S 33 Tnsor Bass for Data Accss For fixd input and output data distribution bass, diffrnt ordrs of instantiating th indics in th indxing function of th data distribution bass (as dfind in (5)) corrspond to diffrnt accss pattrns for out-of-cor data For xampl, if w instantiat th indics in th ordr from right to lft (which is what w hav usd to intrprt a tnsor basis so far), i, g is th slowst and b d is th fastst changing indics, thn w actually accss data first in th first logical block of th first disk and thn accss th first logical block in th scond disk Aftr finishing th accss to th first logical track squntially, th scond logical track is accssd, and so on This data accss pattrn can b bttr undrstood by xamining th following cod, which uss th indics in ach vctor basis as loop indx variabl D g ˆ 0;G 1 D d ˆ 0;D D b b ˆ 0;B b 1 D b d ˆ 0;B d 1 rad gb b DB d db b B d b b B d b d EDD EDD EDD EDD If w instantiat th indx b b in B b b b aftr th indx d in D d in (5), thn it rsults in an accss pattrn whr first th data along a physical track is accssd and thn th succssiv physical tracks ar accssd This chang in th instantiation ordr of th indics can b rgardd as a prmutation of th data distribution basis W call a prmutation of a data distribution basis as a loop basis For th abov xampl, th loop basis can b dnotd as: Lˆ G g B b b b D d B d b d Togthr, a data distribution bass and a loop bass spcify a data accss pattrn To synthsiz a program with this data Lt S b a tnsor basis and Sˆq sˆ1 Lt b a prmutation on xs is 1qŠ, thn a prmutation of S is a tnsor basis dfind as follows: S ˆ q sˆ1 x s i s 8 accss pattrn, vry indx in a loop basis may corrspond to a loop in th gnratd loop nst Morovr, th ordr of th loops in th loop nst is dtrmind by th ordr of th vctor bass in th loop basis A program which accsss out-of-cor data spcifid by th loop basis dnotd by (8) is shown blow D g ˆ 0;G 1 D b b ˆ 0;B b 1 D d ˆ 0;D D b d ˆ 0;B d 1 rad gb b DB d db b B d b b B d b d EDD EDD EDD EDD ot that in th abov program, th indxing function for accssing ach rcord is obtaind by linarizing th data distribution basis Th ordr of loops is spcifid by th loop basis In trms of programs, a loop basis can b undrstood as a notation spcifying how to r-ordr th loop nsts and furthr how to split a loop nst [3] SYTHESIZIG I/-EFFICIET PRGRAMS In this sction, w first giv an ovrviw of our program synthsis framwork W thn dscrib th structur of th gnratd program and how th program can b obtaind from an augmntd tnsor basis In th following sction, w dscrib how to comput th augmntd tnsor basis to obtain th dsird program structur 1 vrviw of Program Synthsis Th thr major stps in synthsizing fficint paralll I/ programs for a block rcursiv algorithm ar shown in Fig 3 Th first stp transforms th input tnsor product formula into an fficint form It uss th targt machin paramtr and proprtis of tnsor products to obtain th fficint form using ithr a grdy approach or an approach basd on dynamic programming It also dtrmins th appropriat input and output data distributions for implmnting th transformd formula Th scond and th third stps ar applid to ach computational stp, which is rprsntd by a tnsor product In th final program, an outrmost loop structur is usd to construct th program for ovrall tnsor product formula Mor

7 LI ET AL: SYTHESIZIG EFFICIET UT-F-CRE PRGRAMS FR BLCK RECURSIVE ALGRITHMS USIG BLCK-CYCLIC DATA 303 Fig 3 Synthsizing paralll out-of-cor programs spcifically, th scond stp dcomposs th computation of ach tnsor product into subcomputations by analyzing data accss pattrns and xploiting locality and concurrncy Th rsults of ths analyss ar rprsntd as an augmntd tnsor basis Th augmntd tnsor basis consists of th following four componnts: data distribution bass, loop bass, subcomputations, and mmory-loads Ths four componnts ar thn usd by th third stp of th cod gnration algorithm to gnrat paralll I/ programs ur prsntation of th drivation of fficint implmntations for th block rcursiv algorithms is in th rvrs ordr of Fig 3 W first prsnt a procdur for cod gnration by using th information containd in th augmntd tnsor basis Thn w dtrmin fficint implmntations for a strid prmutation and a simpl tnsor product with a givn data distribution on a givn modl by dtrmining th corrsponding augmntd tnsor bass Furthr, w dvlop a simpl algorithm to dtrmin th data distribution which can rsult in an fficint implmntation Furthrmor, w us a dynamic (or a multistp dynamic) programming algorithm to dtrmin an fficint implmntation for th block rcursiv algorithms Th dynamic programming algorithm will us th proprtis of tnsor products and th prformanc of ach tnsor product Th mthod of stimating th prformanc for ach tnsor product will b prsntd in Sction 5 and Sction 53 with th analysis of th scond stp (dtrmining augmntd tnsor bass) Structur of th Gnratd Paralll I/ Cod To minimiz th numbr of I/ oprations for a synthsizd program for a tnsor product, w nd to xploit locality by rusing th loadd data This rquirs dcomposing th computation and rorganizing data and data accss pattrns to maximiz data rus In th synthsizd program, th sam subcomputation is prformd svral tims ovr diffrnt data sts Hnc, th loop structur of th synthsizd program is constructd as follows An outr loop nst nclosing thr innr loop nsts: rad loop nst, computation loop nst, and writ loop nst Th rad loop nst loads out-of-cor data without ovrflowing main mmory Th computation loop nst prforms subcomputation on a mmory-load And th writ loop nst writs th ouput to th disk Th data sts ar accssd on track at a tim using paralll primitivs, paralll_rad and paralll_writ To rflct th structur of th outr and innr loops dscribd abov, w nd to sparat input loop bass into thr parts: 1) th part spcifing mmory-loads ( n ), ) th part spcifing th physical tracks in a mmory load ( m ), and 3) th part spcifying th rcords within a track ( ) Undr our stripd I/ modl, ach I/ opration rads and writs in trms of physical track ach tim Hnc, in th synthsizd program, th loops which corrspond to may not appar xplicitly Formally, w can writ th input loop basis as follows: ˆ n m ; whr w call n a mmory basis, sinc ach instantiation of th indics in n corrsponds to a mmory-load Similarly, w can sparat th output loop basis as follows: ˆ n m : 9 10 Morovr, our mthod of dtrmining loop bass will guarant that n is a prmutation of n Furthrmor, in ordr to hav a common outr loop nst, n ˆ n To minimiz th paralll I/ opartions, it is dsirabl that th synthsizd program maks a singl pass ovr th input data That is to say ach mmory-load should hav th following prfct mmory-load proprty: Th input data lmnts of th mmory-load can b organizd to form a st of tracks consistnt with input data distribution and th output data lmnts of th mmory load can b organizd to form a st of tracks consistnt with output data distribution If w can construct prfct mmory-loads, thn w can synthsiz a program which accsss out-ofcor data only onc (calld a on-pass program) Howvr, for som computations, it may not b possibl to construct prfct mmory-loads For ths computations, th synthsizd program kps only part of th rcords from a loadd physical track in th main mmory and discards othr rcords Thrfor, in a multipl-pass program th sam physical track is loadd svral tims In trms of input and output loop bass, prfct mmoryloads can b constructd if and consist of th physicaltrack-lmnt bass from th input and output data distribution bass, rspctivly Hnc, initially, w assum that th initial loop bass and hav th proprtis that and consist of th physical-track-lmnt bass from th input and th output data distribution bass, rspctivly If it turns out that a singl pass program cannot b synthsizd for th computation, thn (or ) is furthr factorizd into two parts, 1 and Furthr, is movd out of and put into n This movd tnsor basis is usd to dtrmin which portions of a physical block should b kpt for th currnt mmory-load Th siz of this movd vctor basis is qual to th numbr of tims th sam physical tracks ar loadd 3 Paralll I/ Cod Gnration In this sction, w first dfin th augmntd tnsor basis and thn dscrib th gnric cod gnration routin which uss th augmntd tnsor basis to gnrat paralll I/ cod

8 30 IEEE TRASACTIS PARALLEL AD DISTRIBUTED SYSTEMS, VL 10, 3, MARCH 1999 Fig Procdur of cod gnration for a tnsor product Fig 5 Cod for I F I,whr X is an array of siz M and A ˆ I F I An augmntd tnsor basis for a singl-procssor multidisk systm includs data distribution bass, loop bass, mmory-loads, and oprations on ach mmory-load Morovr, for a tnsor product computation, th input and output data may b organizd and accssd diffrntly W thrfor nd to us input data distribution basis, output data distribution basis, input loop basis, and output loop basis to dnot thm, rspctivly Dfinition 3 An augmntd tnsor basis constituts th following four componnts: 1 Data Distribution Basis Lt data b distributd by cyclic B on D disks Lt B ˆ B b B d and th numbr of data lmnts b, whr = GBD Thn th (input or output) data distribution basis has th form: Dˆ G g D d B b b b B d b d : 11 Loop Basis An (input or output) loop basis has th following gnric form, LˆL n Lm L1 1 whr L 1 is a subst of L, whr L ˆD D and L 1 ˆL L ; 5 5 As dscribd in Sction, th siz of L (i, ) dpnds upon th tnsor product bing implmntd Th procdur for dtrmining L is dscribd in th following sctions L m consists of th last portions of D L 1 such that jl mjˆm jl 1 j ; L n ˆD L m L 1 3 Mmory-Load Th rcords in ach mmory-load ar dnotd by L m L1 Mor spcifically, ach mmory-load is obtaind by an instantiation of indics in L n, looping ovr indics in L m and using L to idntify which portions in ach loadd physical track should b kpt for th currnt mmory-load Subcomputation Th dcomposd computation which will b applid to ach mmory-load ot that th input and th output data distribution bass can b diffrnt Morovr, th input data distribution basis can b obtaind by factoring th input basis Th output data distribution basis can b obtaind by applying th corrsponding tnsor product or strid prmutation to th input data distribution basis Using this augmntd tnsor basis and assuming that n ˆ n, a gnric program can thn b obtaind as dscribd in Fig Furthr, Fig 5 shows an xampl synthsizd program for I F I W assum that M ˆ 16, D ˆ, B d ˆ, B ˆ, F is a matrix, and data ar distributd in a cyclic mannr It uss 8 g d bd as both th input and th output distribution bass Th input and th output loop bass ar also th sam as g g1 d bd, whr g g1 is a factorization of 8 g Th subcomputation is dnotd by I F I Th mmory basis is g Th dtails of how to dtrmin this information ar discussd in Sction 53

9 LI ET AL: SYTHESIZIG EFFICIET UT-F-CRE PRGRAMS FR BLCK RECURSIVE ALGRITHMS USIG BLCK-CYCLIC DATA SYTHESIZIG PRGRAMS FR STRIDE PERMUTATIS In this sction, w discuss how to dtrmin an fficint augmntd tnsor basis for strid prmutations using a cyclic B distribution ur goal is to dcompos computations into a squnc of subcomputations prformd on prfct mmory-loads In th cas that prfct mmoryloads cannot b constructd, w minimiz th numbr of tims th data is loadd for ach mmory-load In doing so, w nsur that ach physical track of th output is writtn out only onc W first dvlop an approach to dtrmining th input and output loop bass for th givn distribution cyclic B Basd on ths loop bass and data distribution bass, w dtrmin mmory-loads and oprations on th mmory-loads Following this, a program can b synthsizd by using th procdur prsntd in Sction 3 Th cost of th program can also b dtrmind from th loop bass W summariz our rsults in th following thorm and thn prsnt a constructiv proof which constructs th augmntd tnsor basis Dfinition 51 Lt Y ˆ L PQ Q X, whr PQ ˆ and X and Y ar input and output vctors with lngth, rspctivly Lt X and Y b distributd according to cyclic B and th data distribution bass b dnotd as and, rspctivly Furthr, lt ˆ and ˆ Thn, a program can b synthsizd with B 1 maxf1; dd dj jb d D M g 6 paralll I/ oprations for th strid prmutation Y ˆ L PQ Q X Proof W prsnt an algorithm, as shown in Fig 6, for dtrmining th input and th output loop bass Th algorithm is furthr xplaind in Stp 1 as shown blow In Stp and Stp 3, w show how to construct mmory-loads and oprations for a mmory-load In Stp, w show that I/ costs can b obtaind from this information 1 Dtrmin input and output loop bass W bgin with th following construction for th input and th output loop bass, ˆ ; ˆ ; 13 1 whrwusthconvntionthatapparingonth right hand sid rfrs to th original rprsntation, which is qual to 1 3, and apparing on th lft hand sid rfrs to an updat So dos Furthr, w assum that ˆ, ˆ It is asy to vrify that is a prmutation of Thrfor, thy dnot th sam rcords Thus, if th numbr of rcords dnotd by j j is lss than th siz of th main mmory, thn w can simply tak m ˆ and m ˆ Howvr, th numbr of th rcords dnotd by j j may 6 Th notation jsj dnots th siz of th tnsor basis S, which is qual to th multiplication of th dimnsions of ach vctor basis in S Fig 6 Algorithm for dtrmining input and output loop bass for strid prmutations xcd th siz of th main mmory In that cas, w want to construct mmory-loads which can b obtaind by rading th input data svral tims whil writing th output data only onc In trms of tnsor bass, as w discussd in Sction 3, this rloading can b achivd by looping ovr part of th indics in In othr words, w nd to factor as and 1 such that th instantiation of th indics in slcts which subblocks should b kpt for a loadd physical track and th instantiation of th indics in 1 dnots rcords insid ach subblock Furthr, j j is qual to th numbr of tims w will rload ach physical track This rloading is achivd by taking m ˆ and moving bfor m In summary, th input and output loop bass in (13) and (1) ar modifid as follows: Factor such that m consists of th last factors of th factord tnsor basis and th siz of m is qual to M B dd For input loop basis, lt ˆ m, 1 ˆ Thus, th input and output loop bass can b writtn as, ˆ n m 1 ; ˆ n m ; whr n ˆ m 1 and n ˆ m W furthr vrify th following facts: First, m 1 and m contain th sam vctor bass, although in a diffrnt ordr [17] Scond, from th prvious rsults, w hav that j m 1 jˆj m jˆ M Thrfor, th rcords dnotd by thm can fit into a mmory-

10 306 IEEE TRASACTIS PARALLEL AD DISTRIBUTED SYSTEMS, VL 10, 3, MARCH 1999 Fig 7 Paralll I/ Program for L 36 load Third, sinc j m j>j m j ˆ M DB d, loading j m j physical tracks will ovrflow th main mmory unlss som rcords ar discardd from th loadd tracks Th dtails for dtrmining which rcords to b discardd will b discussd in th nxt stp Fourth, n and n contain th sam vctor bass W thrfor can st n ˆ n, which will only chang th ordr of writing rsults onto physical tracks Dtrmin mmory-load Whn j j M; m ˆ and m ˆ Thrfor, th rcords dnotd by m or m can b usd to form a prfct mmory-load Howvr, whn this condition is not satisfid, w nd to us (15) and (16) as th input and output loop bass, rspctivly Bcaus j m 1 jˆj m jˆ M; th siz of ach mmory-load can b st to b qual to th siz of th main mmory Howvr, as w mntiond bfor, w nd to discard som rcords from ach loadd track to form th mmory-load This can b don by linarizing Each instantiation of th indics in will giv a st of subblocks in a physical track which should b kpt 3 Dtrmin oprations for a mmory-load As w mntiond abov, for ach mmory load, th tnsor vctors in th input and output loop bass which dnot th rcords insid a mmory-load ar th sam, but in a diffrnt ordr In othr words, on is a prmutation of th othr Bcaus th input and output loop bass ar prmutations of th input and output data distribution bass, w actually prmut a mmory-load of data ach tim Thrfor, ach in-mmory opration is nothing mor than a prmutation for a subst of data distribution bass dnotd by m 1 and m ot that whn ˆ, 1 ˆ I/ cost of synthsizd programs It is asy to s that if j j M, a on-pass program can b synthsizd, i, th numbr of paralll I/s is B dd Whn th abov condition dos not hold, w kp j 1 j rcords for ach loadd physical track and load th sam physical track j j tims Morovr, sinc j m jˆ M DB d, it can b asily dtrmind that j Bcaus w writ out ach rcord only onc, th numbr jˆ j jbdd M of paralll I/ oprations is 1 j jb d D M B dd Combining ths two cass yilds th prformanc rsults prsntd in th thorm Furthr, a program with this prformanc can b synthsizd by using th procdur listd in Fig tu W now us an xampl to illustrat th mthods of dtrmining augmntd tnsor bass and synthsizing paralll I/ programs for strid prmutations Assum that w hav a strid prmutation L 36, which can b intrprtd as an 8 matrix transposition Th paramtrs of th modl ar dfind as follows: D ˆ, B d ˆ, B b ˆ, and M ˆ 8 Thn th input and output data distribution bass can b writtn as follows: ˆ g d bb bd ; 17 ˆ g 0 d 0 b 0 b b 0 : 18 d Morovr, th output data distribution basis can also b obtaind by applying th strid prmutation L 36 to th input data distribution basis In othr words, it can b writtn as: ˆ b b bd g d : 19 Thn, following th procdur of th proof of Thorm 51, w can first dtrmin th input and output loop bass as follows W first factor g as g g1 Thn, by th algorithm prsntd in Fig 6, w hav:

11 LI ET AL: SYTHESIZIG EFFICIET UT-F-CRE PRGRAMS FR BLCK RECURSIVE ALGRITHMS USIG BLCK-CYCLIC DATA 307 Fig 8 Exampl matrix transposition (a) Inputs whn viwd as an 8 two-dimnsional array (b) Input data distribution on two disks (c) Load physical tracks T 0, T, in-cor prmutation, and writ to physical tracks T 0, T (d) Load physical tracks T 1, T 5, in-cor prmutation and writ to physical tracks T, T 6 ˆ d bd ; ˆ g d ; 0 m ˆ g ; m ˆ b d ; n ˆ g 1 bb ; n ˆ g 1 bb : 1 Furthr, th rcords dnotd by m or m will b usd to form prfct mmory-loads Th in-cor computation can b dtrmind by finding out th prmutation which prmuts m to m This can b asily dtrmind as L 8 Sinc j jm, a on-pass program, as shown in Fig 7, can b synthsizd by using th information dtrmind abov and th cod gnration algorithm prsntd in th prvious subsction Th procdur of computing L 36 using th synthsizd program is illustratd in Fig 8 and Fig 9 Fig 8 shows th input vctor whn xplaind as a matrix and its initial data distribution on two disks It also shows th first two intrmdiat subtransposition stps Fig 9 illustrats th succssiv two intrmdiat stps and th final outputs Each of th intrmdiat subtransposition stps rads a block of matrix, transposs th block in th intrnal mmory, and thn writs th block onto disks For clarity, w assum that th outputs ar writtn on a diffrnt st of disks 6 SYTHESIZIG PRGRAMS FR TESR PRDUCTS In this sction, w first prsnt an algorithm to dtrmin fficint loop bass for a tnsor product undr a givn data distribution cyclic B Basd on ths loop bass and data distribution bass, w can dtrmin mmory-loads and oprations to ach mmory-load In othr words, th augmntd tnsor basis can b obtaind Thrfor, a program can b gnratd by using th procdur discussd in Sction 51 W also show that th cost of th program synthsizd can b obtaind from th algorithm Sinc th computation of th tnsor product I R AV IC dos not chang th ordr of th inputs (or it can b computd in-plac), w will us th sam input and output data distribution bass for th input and output data and also th sam input and output loop bass for programs synthsizd in this subsction Thrfor, w will only considr input, input distribution, and input loop bass W summariz our rsults as a thorm and thn prsnt a constructiv proof which constructs th augmntd tnsor basis Bfor w prsnt th thorm, w first introduc th concpt of dsird rcords and discuss svral proprtis of th possibl locations in which th dsird rcords may rsid on disks For th tnsor product I R AV IC, th major computational matrix A V is applid to V input rcords and ths V rcords hav a strid C in th input vctor W call ach of

12 308 IEEE TRASACTIS PARALLEL AD DISTRIBUTED SYSTEMS, VL 10, 3, MARCH 1999 Fig 9 Exampl matrix transposition (a) Load physical tracks T, T 6, in-cor prmutation, and writ to physical tracks T 1, T 3 (b) Load physical tracks T 3, T 7, in-cor prmutation, and writ to physical tracks T 5, T 7 (c) utputs ths V rcords for th first A V computation a dsird rcord Mor spcifically, V dsird rcords can b dnotd as fx icšj0 i V 1g ot that all of th othr A V computations will hav a similar data accss pattrn For xampl, th scond A V computation is applid to th V inputs bginning from th scond rcord with th sam strid C W now discuss svral proprtis of th possibl locations in which th dsird rcords may rsid on disks Th conscutiv dsird rcords will b first stord in a logical block, and thn th succssiv dsird rcords will b stord to othr logical blocks on othr disks Thus, for xampl, whn C>B d and VC <B, th numbr of physical tracks which holds th V V dsird rcords is C=B d rathr than C= B d D If th dsird rcords ar stord on svral disks, thn ach of ths disks will contain th sam numbr of dsird rcords and th dsird rcords in ach of ths disks ar stord in th sam rlativ locations If th dsird rcords ar stord on svral logical tracks, thn all of th logical tracks which contain th dsird rcords will hav th sam numbr of dsird rcords and th dsird rcords in ach logical track ar stord in th sam rlativ locations Th corrctnss of ths proprtis follows th dfinition of data distribution, th rgular data accss pattrn of ach computational matrix in th input tnsor product, and th assumptions that all of th paramtrs in th machin modl and th input tnsor product ar powrs of two For xampl, th corrctnss of th first proprty can b xplaind as follows Sinc VC <B, all of th dsird rcords ar stord in th first logical block Th distanc of th physical blocks which contain th dsird rcords is C B d Thrfor, th numbr of physical tracks which hold th V dsird rcords is C=B d Ths proprtis will b usd in th proof of th following thorm Thorm 6 Lt th input data b distributd according to cyclic B Lt t dnot th numbr of physical tracks whr th rcords for an A V computation ar stord Thn for th tnsor product I R AV IC, whr RV C ˆ and V M, if t M B dd, a program can b synthsizd with B dd paralll I/ oprations; othrwis a program can b synthsizd with 3 M t paralll I/ oprations Th abov thorm can also b statd in trms of tnsor bass as follows: Lt b th input data distribution basis Lt = Furthr assum that 1 dnots a subst of and 1 ˆ is movd into th mmory basis Thn for th tnsor product I R AV IC, whr RV C ˆ and V M, if 1 ˆ, a program can b synthsizd with B dd paralll I/ oprations; othrwis, a program can b synthsizd with j j 3 B dd In th following proof of th thorm, w will show how to construct 1 and W will also prov that j jˆ BdD M t Proof 1 Dtrmin input loop basis If th dsird rcords for an A V computation ar stord in t physical tracks and t M B d D, thn w can simply load th t physical tracks ach tim and, thrfor, a on-pass program can b gnratd Howvr, whn t > M B dd, w cannot kp all of th rcords in t physical tracks in th main mmory W tak th following simpl approach:

13 LI ET AL: SYTHESIZIG EFFICIET UT-F-CRE PRGRAMS FR BLCK RECURSIVE ALGRITHMS USIG BLCK-CYCLIC DATA 309 Fig 10 Algorithm for dtrmining input loop bass and th valu of Z for a tnsor product W construct M=V sts of dsird rcords by loading ach physical and rtaining in th main mmory only thos rcords which fall in ths sts Each physical track nds to b rloadd to prform computation on th rmaining rcords In trms of tnsor bass, w nd to do nothing mor than factor and prmut th input data distribution basis to rflct this data accss pattrn Mor spcifically, w bgin with ˆ and n m ˆ, whr has th sam initial valu as dfind in Sction 5 For a on-pass program, w factor and prmut n m to chang th ordr of accssing physical tracks Howvr, for a multipass program, w nd to factor and prmut all of th s, sinc w nd to kp part of th rcords loadd in th main mmory and discard othr rcords As w discussd bfor, th part of th rcords to b kpt or discardd can b dnotd by a subst of th vctor bass in th physical-track-lmnt basis In ordr to factor and prmut a tnsor basis to a dsird form, w nd to xamin th rlativ valus of th paramtrs in th targtd I/ modl, th tnsor product, and th siz B of th data distribution W summariz th abov idas as an algorithm in Fig 10, which is furthr xplaind as follows: Initialization This stp initializs th valus of n m,, and svral tmporary variabls For xampl, R b dnots th maximum numbr of th dsird rcords for an A V computation in a physical block R t is th numbr of th dsird rcords in a physical track R d is th numbr of disks whr th dsird rcords for an A V ar stord S is th distanc of two conscutiv physical tracks which contain th dsird rcords Sinc th strid of two dsird rcords is C, R b can b dtrmind as d Bd C Th corrctnss of R t and t can b similarly vrifid Comput will invok a procdur to comput th valus such as R d and S Fig 11a and Fig 11b show th dtails on how to dtrmin thos two valus Th corrctnss of th algorithm in Fig 11a for computing R d can b provn as follows: Whn C B d, th succssiv disks may contain th sam numbr of th dsird rcords if th dsird rcords can not b stord in on logical block Th numbr of ths succssiv disks is dpndnt on th valu of V Furthr, sinc thr ar R b B b dsird rcords pr logical block and R b B b ˆ B C (sinc R b ˆ Bd C in this cas), th numbr of disks which contain th dsird rcords is V qual to th smallr of B=C and D This rsults in th first cas of th algorithm Similarly, whn B d <C B, th succssiv disks may contain th dsird rcords Sinc in this cas, ach logical block contains B C dsird rcords, th numbr of disks which contain th dsird rcords is again qual to th smallr of V B=C and D For th third cas, any two disks which contain two conscutiv dsird rcords hav a strid C B Thrfor, R d ˆ D C=B Th last cas is trivial Similarly, w can prov th corrctnss of th algorithm in Fig 11b n-pass program This stp dtrmins how to accss physical tracks Th ida is straightforward It dtrmins th dcompositions and prmutations for n m basd on th strid btwn two conscutiv physical tracks which contain th dsird rcords Th rsult from this stp may also b ndd for th nxt stp to dtrmin th final loop basis for synthsizing multipass programs Multipass program If th numbr of physical tracks which hold th rcords for an A V computation is largr than th numbr of physical tracks which th main mmory can hold, thn a multipass program nds to b synthsizd Mor spcifically, w nd to dtrmin which portions of th rcords in a physical track should b kpt for ach pass of computation Th basic ida of kping

310 IEEE TRASACTIS PARALLEL AD DISTRIBUTED SYSTEMS, VL 10, 3, MARCH 1999 Fig 11 (a) Algorithm for computing R d and (b) Algorithm for computing S rcords for th currnt mmory-load can b dscribd as

14 310 IEEE TRASACTIS PARALLEL AD DISTRIBUTED SYSTEMS, VL 10, 3, MARCH 1999 Fig 11 (a) Algorithm for computing R d and (b) Algorithm for computing S rcords for th currnt mmory-load can b dscribd as follows: First, for ach dsird rcord, w want to tak X 1 succssiv rcords and kp ths X 1 rcords with th corrsponding dsird rcord as th currnt mmory-load n approach of dtrmining X is to tak X as larg as possibl Howvr, X nds to satisfy th following thr conditions First, X must b lss than th gap btwn any two conscutiv dsird rcords in a physical block Scond, X must b lss than th siz of a physical block Third, all of th dsird rcords with thir X 1 succssiv rcords should b abl to fit into th main mmory, which mans that XR t t M, orxv M Ths thr conditions can b xprssd as M X ˆ minfc;b d ; R t t g Fig 1 shows an xampl of how to construct mmory-loads by taking portions of th rcords from a physical block, whr w assum that thr ar four dsird rcords in a physical block, and C ˆ X Th xampl can b intrprtd as follows Th physical block is first brokn into ight subblocks Thn w tak th rcords in th oddnumbrd subblocks to construct on mmory-load and tak th rcords in th vnnumbrd subblocks to construct anothr mmory-load In trms of tnsor bass, w first dcompos Bb b b as b d3 bd X bd1 Thn, w prmut th rsulting tnsor basis as b d bd3 X bd1 Scond, w apply a similar ida for disks For ach disk which contains th dsird rcord, w tak Y 1 succssiv disks and w kp th rcords at th sam rlativ locations with th original disk in ach succssiv disk for th currnt mmory-load W want to tak th largst possibl valu of Y givn th condition that th numbr of th rcords kpt must fit into th main mmory W considr th following two cass First, X ˆ minfb d ;Cg In this cas, ithr all of th rcords btwn any two dsird rcords or all of th rcords in a physical block ar chosn to b kpt for th currnt mmoryload Howvr, if all of th rcords btwn any two dsird rcords ar chosn, all of th rcords in a physical block will b covrd Thus, it is idntical to th cas that all of th rcords in a physical block ar chosn to b kpt Furthr, R d disks contain dsird rcords Thrfor, R d B d rcords ar chosn from ach physical track In ordr to not ovrflow th main mmory, w nd that R d YB d t M Scond, X ˆ M R t t In this cas, Fig 1 Constructing portions of mmory-loads from a physical block

15 LI ET AL: SYTHESIZIG EFFICIET UT-F-CRE PRGRAMS FR BLCK RECURSIVE ALGRITHMS USIG BLCK-CYCLIC DATA 311 w do not choos all of th rcords btwn two dsird rcords Howvr, sinc w hav alrady chosn th largst possibl valu for X, th main mmory has bn filld up in this cas Thrfor, w can not add any mor rcords from succssiv disks from this approach In othr words, Y ˆ 1 An xampl, which is similar to th xampl shown in Fig 1, can b constructd for disks Mor spcifically, if w viw th rcords in a physical block as disks, X as Y, R b as R d, thn w hav an xampl for disks Furthr, in trms of tnsor bass, w can intrprt this ida as follows: W first dcompos D D d as R d Rd Y d 3 d Y d1 Thn, w D R prmut it as d Y d R d d 3 Y d1 Th rsulting tnsor basis allows us to accss odd-numbrd subst of disks first and thn vnnumbrd subst of disks W now considr an xampl which contains both disks and rcords in physical blocks Mor spcifically, w considr th xampl in which data can b rprsntd by combining factord D d and B b b b Assum that w want to accss th rcords first in th oddnumbrd disk subblocks and thn in th vn-numbrd disk subblocks Furthr, for ach physical block w want to accss th rcords first in th odd-numbrd subblocks and thn in th vn-numbrd subblocks To D R achiv this data accss pattrn, w mov d Y B d R and b X b d from thir currnt locations in D d B b b b to th bginning of D d B b b b In th algorithm prsntd in Fig 10, w hav D Bd R dnotd d Y R d b X b d as Thrfor, to construct ach mmory-load, w can simply mov into n m and put thm anywhr in n 7 For th following analysis, w assum that w hav found th substs of, namly 1 and, by th abov algorithm is movd into th mmory basis and will gnrat loop nsts for data accss Th othr portions of th algorithm, which ar usd for computing th valu of Z, will b discussd in Stp 3 Dtrmin mmory-load For a on-pass program, w can simply factor as n m and tak j m jˆ M B d D For a multipl-pass program, w factor 1 to b n m such that j m jˆ M j 1 j and all of th vctor bass in appar in n 7 Mor spcifically, th initial n m should b modifid to 0 n 0 m, whr 0 m contains th last factors of n m and j 0 m jˆ M j 1 j, and 0 n contains n m 0 m and d Morovr, for th multipl-pass program, as discussd in Sction 5, w us to dtrmin which rcords should b kpt for th currnt mmory-load 3 Dtrmin oprations for a mmory-load Th original tnsor product can b rgardd as R paralll applications of A V to th inputs with a strid C Whn data ar distributd among disks and loadd in units of physical tracks, th nt ffct is to possibly rduc th strid of th rcords which ach A V will accss in main mmory Th oprations on a mmory-load hav th gnral form of I M AV IZ VZ Howvr, th valu of Z will dpnd on th rlativ valus of th paramtrs in th targt machin modl and th input tnsor product Th algorithm prsntd in Fig 10 can b usd to dtrmin this valu Th corrctnss of th valu of Z obtaind from th algorithm can b provn as follows: For on-pass programs, whn C B d, w do not chang th strid for subcomputations Thrfor, Z ˆ C thrwis, th strid will b rducd to b qual to th distanc of two conscutiv dsird rcords in a physical track, which is qual to D R t B d For multipass programs, whn X ˆ C, w choos all of th rcords btwn any two dsird rcords for th currnt mmory-load, so th strid of in-cor computation dos not chang Whn X ˆ M R t t, w rduc th strid of in-cor computation from C to X Whn X ˆ B d, th nxt dsird rcord is not in th sam physical block Sinc w kp Y disks as a subst of disks, w rduc th strid from C to YB d I/ cost of synthsizd programs For a on-pass program which dos not mov any vctor bass in, th numbr of paralll I/s is simply qual to B d D In othr words, th synthsizd program is optimal in trms of th numbr of I/s For a multipass program, w nd to rad th inputs j j tims Thrfor, th numbr of paralll I/ oprations is j j 3 B dd From th algorithm prsntd in Fig 10, w can dtrmin that j jˆ DBd M t W thrfor can attain th prformanc prsntd in th thorm Th constant 3 can b xplaind as follows: Whn w stor a physical track, w nd to rad that physical track into main mmory again, sinc portions of th rcords in that physical track hav bn discardd By rloading this physical track, w can rassmbl th physical track with th part of updatd rcords and thn writ it out in paralll thrwis, part of th rcords to b writtn out in that physical track may not b corrct Furthr, ªrassmblingº th physical track nds to us th tnsor basis (notic that is qual to to put th updatd rcords into th corrct locations on th physical track This is similar to using to tak subblocks out from a loadd physical track for th currnt mmoryload

16 31 IEEE TRASACTIS PARALLEL AD DISTRIBUTED SYSTEMS, VL 10, 3, MARCH 1999 Fig 13 Algorithm for computing th fficint siz of data distributions ow, a program with th prformanc discussd abov can b synthsizd by using th procdur listd in Fig Howvr, to b accurat, whn synthsizing a multipass program, w nd to incorporat th ida of ªrassmblingº a physical track into th writout part of th procdur listd in Fig, which, as w discussd abov, is nothing mor than using th linarization of to put subblocks in th currnt mmory-load into th corrct locations of th rloadd physical track tu ot that th valu of t can b dtrmind at th initialization stp Thrfor, th prformanc of th synthsizd program for a tnsor product can b dtrmind without gnrating th whol augmntd tnsor basis This rsult is usd in th first phas of transforming tnsor product formulas, whr w nd th prformanc valu for ach tnsor product to dtrmin fficint transformations 61 Dtrmining Efficint Data Distributions In th prvious sctions, w prsntd approachs for synthsizing fficint I/ programs for a givn data distribution W now prsnt an algorithm to dtrmin a data distribution which optimizs th prformanc of th synthsizd program Th ida of th algorithm is as follows: W bgin with th physical track distribution cyclic B d, i, initially B ˆ B d If a on-pass program can b synthsizd undr this data distribution, thn B d is th dsird block siz for th data distribution thrwis, w doubl th valu of B If th prformanc of th synthsizd program undr this distribution incrass, w continu this procdur thrwis, th algorithm stops and th currnt block siz is th dsird siz of data distributions W formaliz this ida in Fig 13 6 Transforming Tnsor Product Formulas In this sction, w discuss tchniqus of program synthsis for tnsor product formulas Thr ar svral stratgis for dvloping I/-fficint programs, such as xploiting locality and xploiting paralllism in accssing th data Similar idas hav bn discussd in [15], whr thy us factor grouping to xploit locality and data rarrangmnt to rduc th cost of I/ oprations W hav also prsntd a grdy mthod which uss factor grouping to improv th prformanc of block rcursiv algorithms for Vittr and Shrivr's stripd two-lvl mmory modl with a fixd block siz of data distribution [10] Factor grouping combins contiguous tnsor products in a tnsor product formula and thrfor rducs th numbr of passs to accss scondary storag Considr th cor Cooly-Tuky FFT computation, which dos not contain th initial bit-rvrsal opration and th twiddl factor computation For i= and i=3, w hav th tnsor products I n F I and I n 3 F I, rspctivly Assuming that ach of ths tnsor products can b implmntd optimally, th numbr of paralll I/ oprations rquird to implmnt ths two stps individually is DB Howvr, thy ar contiguous tnsor products in () Hnc, by using th proprtis of tnsor products, such as Proprtis 1 and listd in Sction 3, thy can b combind into on tnsor product, I n F I I n 3 F I ˆ I n 3 I F I I n 3 F I I ˆ I n 3 I F I I n 3 F I I ˆ I n 3 F F I ; which may also b implmntabl optimally by using only paralll I/ oprations DB d Data rarrangmnt uss th proprtis of tnsor products to chang data accss pattrns For xampl, th tnsor product I R AV IC can b transformd into th AV )(I R L VC C quivalnt form (I R L VC V )(I RC ) In th bst cas, th numbr of paralll I/s rquird is 6 DB d aftr using this transformation, sinc at last thr passs ar ndd for th transformd form Bcaus of th xtra passs introducd by this transformation, it is not profitabl to us it for our targtd machin modl Furthr, th first and th last trms in th transformd formula may not b implmntabl optimally Thrfor, w hav not incorporatd this transformation into our currnt optimization procdurs 61 Minimizing I/ Cost by Dynamic Programming Sinc factor grouping (as shown abov) and th siz of th data distribution (as will b shown in th nxt sction) hav a larg influnc on th prformanc of synthsizd programs, w tak th following approach for dtrmining an optimal mannr in which a tnsor product formula can b implmntd W us th algorithm for dtrmining th optimal data distribution prsntd in Fig 13 as a main routin Howvr, for ach cyclic B data distribution, w us a dynamic programming algorithm to dtrmin th optimal factor grouping Hnc, w also call this mthod a multi-stp dynamic programming mthod TABLE 1 umbr of I/ Passs for Strid Prmutation L PQ Q D ˆ, B d ˆ, M ˆ 6, and ˆ PQ ˆ 08

17 LI ET AL: SYTHESIZIG EFFICIET UT-F-CRE PRGRAMS FR BLCK RECURSIVE ALGRITHMS USIG BLCK-CYCLIC DATA 313 TABLE umbr of I/ Passs for Strid Prmutation L PQ Q D ˆ 16, B d ˆ 51, M ˆ, and ˆ PQ ˆ 50 Lt C i; jš b th optimal cost (th minimum numbr of I/ passs rquird to accss th out-of-cor data) for computing j i tnsor factors from th ith factor to th jth factor in a tnsor product formula Thn C i; jš can b computd as follows: C i; jš ˆ C0 if i ˆ j min ikj fc i; kš C k 1;jŠg if i j : In th abov formula, C 0 dnots th cost for computing a tnsor product Th mthod of dtrmining th cost of a tnsor product has bn discussd in Sction 53 Th valus of C 0 can b computd using th rsults in Thorm 6 and th algorithm prsntd in Fig 11a to comput t A spcial cas of k ˆ j nds to b furthr xplaind Whn k ˆ j, w assum that C j 1;jŠˆ0 and w us C i; kš to rprsnt th cost of grouping all th tnsor product factors from i to j togthr Bcaus th groupd tnsor product is a simpl tnsor product, th valu of C i; kš in this cas can also b dtrmind by using th rsults in Thorm 6 and th algorithm prsntd in Fig 11a to comput t Howvr, in this cas, if k i>m, or th siz of groupd oprations is largr than th siz of th main mmory, w do not want to group all of th k i factors togthr W assign a larg valu, such as 1, toc k; jš to prvnt it from bing slctd M t ˆ M B dd is th maximum numbr of physical tracks in a mmory-load W can vrify that th rsults prsntd hr ar mor comprhnsiv than th rsults prsntd in [10] In most cass, using th approach prsntd in Sction 53, w can actually synthsiz programs with bttr prformanc For xampl, whn VC >M, M<VDB d and M>VB d, from [10], a program with VB dd M passs will b synthsizd Howvr, for ths conditions, w hav that C>B d and VC >M If w furthr assum that C<B d D, thn from th rsults in Tabl 3 and Tabl, w can synthsiz a program with VC M passs, which is lss than VB dd M W now show that by using an appropriat cyclic B data distribution, a bttr prformanc program can b synthsizd for most of th cass Svral typical xampls ar shown in Tabl 6 W notic that whn w incras B, w can rduc th numbr of passs of data accss for most of th cass and th dcras in th numbr of passs can b as larg as ight tims Th valus in th tabl also suggst that w can us th algorithm prsntd in Fig 13 to find an fficint siz of data distributions for a givn tnsor product W also notic that for som cass, such as C B d, w can not improv th prformanc Th rason is that th strid rquird by A V is lss than th siz of th physical block and w cannot rduc it furthr by rdistribution 7 PERFRMACE RESULTS F SYTHESIZED PRGRAMS 71 Matrix Transposition Givn th flxibility of choosing diffrnt data distributions, w can synthsiz programs with bttr prformanc than thos obtaind using fixd siz data distributions for strid prmutations W prsnt a st of xprimntal rsults for th numbr of I/ oprations rquird by th cyclic B d distribution and cyclic B distribution, whr th siz B of th distribution varis Ths rsults ar summarizd in Tabl 1 and Tabl From th tabls, w can s that th numbr of passs is not a monotonically incrasing or dcrasing function Howvr, it normally dcrass and thn incrass as B is incrasd Thrfor, it is likly that th algorithm in Fig 13 will find an fficint siz of data distributions 73 Tnsor Product Formulas W show th ffctivnss of th multistp dynamic programming mthod by comparing th programs synthsizd by it with th programs synthsizd by th grdy mthod and th dynamic programming mthod (applid to a data distribution of fixd siz), rspctivly Th xampl w us is th cor Cooly-Tuky FFT computation Th rsults for svral typical sizs of inputs ar shown in Tabl 7 W find that using dynamic programming for a fixd siz cyclic B d distribution normally cannot improv prformanc ovr th grdy mthod Howvr, by using th multistp dynamic programming mthod, w can rduc TABLE 3 umbr of I/ Passs for th Tnsor Product I R AV IC 7 Tnsor Products Th numbr of I/ passs rquird by th synthsizd programs ar summarizd in Tabl 3, Tabl, and Tabl 5 by going through various cass of t In thos tabls,

18 31 IEEE TRASACTIS PARALLEL AD DISTRIBUTED SYSTEMS, VL 10, 3, MARCH 1999 TABLE umbr of I/ Passs for th Tnsor Product I R AV IC TABLE 5 umbr of I/ Passs for th Tnsor Product I R AV IC TABLE 6 umbr of I/ Passs for th Tnsor Product I R AV IC with Various Data Distributions D ˆ 16, B d ˆ 51, M ˆ, and ˆ RV C TABLE 7 umbr of I/ Passs for th Synthsizd Programs Using Grdy, Dynamic Programming (DP) and Multipl-Stp Dynamic Programming (MDP) Mthods (D ˆ 16, B d ˆ 51, and M ˆ ) th numbr of passs for th synthsizd programs by at last 1 if is vry larg Bcaus th input siz is larg, th prformanc gain by liminating vn on pass to accss out-of-cor data is significant 8 CCLUSIS W hav prsntd a novl framwork for synthsizing outof-cor programs for block rcursiv algorithms using th algbraic proprtis of tnsor products W usd th stripd Vittr and Shrivr's two lvl mmory modl as our targt machin modl Howvr, instad of using th simplr physical track distribution normally usd by this modl, w usd various block-cyclic distributions supportd by th High Prformanc Fortran to organiz data on disks Morovr, w us tnsor bass as a tool to captur th smantics of data distributions and data accss pattrns W showd that by using th algbraic proprtis of tnsor products, w can dcompos computations and arrang data accss pattrns to gnrat out-of-cor programs automatically W dmonstratd th importanc of choosing th appropriat data distribution for th fficint out-of-cor implmntations through a st of xprimnts Th xprimntal rsults also showd that our simpl algorithm for choosing th fficint data distribution is vry ffctiv From th obsrvations about th importanc of data distributions and factor grouping for tnsor products, w proposd a dynamic programming approach to dtrmin th fficint data distribution and th factor grouping For an xampl FFT computation, this dynamic programming approach rducd th numbr of I/ passs by at last on compard to th simplr grdy algorithm ACKWLEDGMETS Supportd by US ational Scinc Foundation Grant SF- IRI , Rom Labs Contracts F C-0037,

LI ET AL: SYTHESIZIG EFFICIET UT-F-CRE PRGRAMS FR BLCK RECURSIVE ALGRITHMS USIG BLCK-CYCLIC DATA 315 ARPA/SIST contracts 0001-91-J-1985, and 0001-9- C-018 undr subcontract KI-9-01-018 [3] M Wolf,

19 LI ET AL: SYTHESIZIG EFFICIET UT-F-CRE PRGRAMS FR BLCK RECURSIVE ALGRITHMS USIG BLCK-CYCLIC DATA 315 ARPA/SIST contracts J-1985, and C-018 undr subcontract KI [3] M Wolf, High Prformanc Compilrs for Paralll Computing Addison-Wsly, 1996 REFERECES [1] GE Bllloch, Vctor Modls for Data-Paralll Computing Th MIT Prss, 1990 [] A Choudhary, I Fostr, G Fox, K Knndy, C Ksslman, C Kolbl, J Saltz, and M Snir, ªLanguags, Compilrs, and Runtim Systms Support for Paralll Input-utput,º Tchnical Rport CCSF-39, Scalabl I/ Initiativ, Caltch Concurrnt Suprcomputing Facilitis, Caltch, 199 [3] TH Cormn, ªVirtual Mmory for Data-Paralll Computing,º PhD thsis, Dpt of Elctrical Eng and Computr Scinc, Massachustts Inst of Tchnology, 199 [] TH Cormn and D Kotz, ªIntgrating Thory and Practic in Paralll Fil Systms,º Tchnical Rport PCS-TR93-188, Dpt of Math and Computr Scinc, Dartmouth Collg, Mar 1993 [5] DL Dai, SKS Gupta, SD Kaushik, JH Lu, RV Singh, C-H Huang, P Sadayappan, and RW Johnson, ªEXTET: A Portabl Programming Environmnt for Dsigning and Implmnting High Prformanc Block Rcursiv Algorithms,º Proc Suprcomputing '9, pp 9±58, 199 [6] J Eklundh, ªA Fast Computr Mthod for Matrix Transposing,º IEEE Trans Computrs, vol 0, no 7, pp 801±803, July 197 [7] DG Fitlson, PF Corbtt, Y Hsu, and J-P Prost, ªParalll I/ Systms and Intrfacs for Paralll Computrs,º Multiprocssor SystmsÐDsign and Intgration C-L Wu, d, World Scintific, 1995 [8] J Granta, M Connr, and R Tolimiri, ªRcursiv Fast Algorithms and th Rol of th Tnsor Product,º IEEE Trans Signal Procssing, vol 0, no 1, pp,91±,930, Dc 199 [9] SKS Gupta, ªSynthsizing Communication-Efficint Distributd- Mmory Paralll Programs for Block Rcursiv Algorithms,º PhD thsis, Th hio Stat Univ, Mar 1995 [10] SKS Gupta, Z Li, and JH Rif, ªGnrating Efficint Programs for Two-Lvl Mmoris from Tnsor Products,º Proc Svnth IASTED/ISMM Int'l Conf Paralll and Distributd Computing and Systms, pp 510±513, Washington DC, ct 1995 [11] C-H Huang, JR Johnson, and RW Johnson, ªGnrating Paralll Programs from Tnsor Product Formulas: A Cas Study of Strassn's Matrix Multiplication Algorithm,º Proc Int'l Conf Paralll Procssing 199, pp 10±108, Aug 199 [1] JR Johnson, RW Johnson, D Rodriguz, and R Tolimiri, ªA Mthodology for Dsigning, Modifying and Implmnting Fourir Transform Algorithms on Various Architcturs,º Circuits Systms and Signal Procssing, vol 9, no, pp 50±500, 1990 [13] RW Johnson, C-H Huang, and JR Johnson, ªMultilinar Algbra and Paralll Programming,º J Suprcomputing, vol 5, pp 189±18, 1991 [1] SD Kaushik, C-H Huang, JR Johnson, RW Johnson, and P Sadayappan, ªEfficint Transposition Algorithms for Larg Matrics,º Proc Suprcomputing '93, ov 1993 [15] SD Kaushik, C-H Huang, RW Johnson, and P Sadayappan, ªA Mthodology for Gnrating Efficint Disk-Basd Algorithms from Tnsor Product Formulas,º Proc Sixth Ann Workshop Languags and Compilrs for Paralll Computing, pp 358±338, Aug 1993 [16] B Kumar, C-H Huang, P Sadayappan, and RW Johnson, ªAn Algbraic Approach to Cach Mmory Charactrization for Block Rcursiv Algorithms,º Proc 199 Int'l Computr Symp, pp 336± 3, 199 [17] Z Li, ªComputational Modl and Program Synthsis for Paralll ut-of-cor Computation,º PhD thsis, Duk Univ, 1996 [18] CV Loan, Computational Framworks for th Fast Fourir Transform SIAM, 199 [19] R Paig, JH Rif, and R Wachtr, Paralll Algorithm Drivation and Program Transformation Kluwr Acadmic, 1993 [0] HS Ston, ªParalll Procssing with th Prfct Shuffl,º IEEE Trans Computrs, vol 0, no, pp 153±161, Fb 1971 [1] R Thakur, R Bordawkar, and A Choudhary, ªCompilation of ut-of-cor Data Paralll Programs for Distributd Mmory Machins,º Proc IPPS '9 Workshop Input/utput in Paralll Computr Systms, pp 5±7, Apr 199 Also appard in Computr Architctur ws, vol, no [] JS Vittr and EAM Shrivr, ªAlgorithms for Paralll Mmory I: Two-Lvl Mmoris,º Algorithmica, vol 1, nos -3, pp 110±17, 199 Zhiyong Li rcivd th BS and MS dgrs in computr scinc and nginring from Huazhong Univrsity of Scinc and Tchnology, Popls Rpublic of China, in 198 and 1987, rspctivly, and th PhD dgr in computr scinc from Duk Univrsity in 1996 From 1987 to 199, h was an assistant profssor in th Dpartmnt of Computr Scinc and Enginring at Huazhong Univrsity of Scinc and Tchnology H workd at th Prformanc Lab of Sun Microsystms in 1996 and was on of th main dsignrs for standard Java bnchmarks Sinc 1997, h has bn with th IBM twork Computing Softwar Division at Rsarch Triangl Park, currntly working on Intrnt lctronic commrc Dr Li has publishd mor than 0 paprs in rfrd journals and confrncs in th ara of programming languags, paralll and distributd computing, and artificial intllignc H has applid for thr US patnts rlatd to Java and objctd-orintd tchnologis John H Rif rcivd th BS (magna cum laud) dgr in applid math and computr scinc in 1973 from Tufts Univrsity, and th MS and PhD dgrs in applid mathmatics from Harvard Univrsity, in 1975 and 1977, rpctivly H is currntly a profssor in th Dpartmnt of Computr Scinc, Duk Univrsity, Durham, orth Carolina H has workd for many yars on th dvlopmnt and analysis of paralll and randomizd algorithms for various fundamntal problms, including solutions of larg spars systms, sorting, and graph problms H is th author of mor than 10 publications to dat His rsarch combins thory and practic Although primarily a thortical computr scintist, Prof Rif also has mad a numbr of contributions to practical aras of computr scinc, including paralll architcturs, robotics, data comprssion, molcular simulations, and optical computing H has don a numbr of implmntations of sophisticatd paralll algorithms, such as paralll nstd dissction on massivly paralll machins, as wll as implmntations of paralll data comprssion algorithms into spcial purpos chips H has focusd particularly on mrging nw aras, such as biomolcular computing H is dirctor of th Consortium of Biomolcular Computing and Applications, which consists of most of th major US rsarch groups in biomolcular computing Dr Rif has rcntly had two books publishd on paralll algorithms and implmntations forwhich h was ditorðsynthsis of Paralll Algorithms (Kluwr Acadmic Publishrs, 1993) and Paralll Algorithm Drivation and Program Transformation (co-ditd with R Paig and R Wachtr) Dr Rif is a fllow of th ACM (1996), a fllow of th IEEE (1993), and a fllow of th Institut of Combinatorics (1991) Sandp KS Gupta rcivd th BTch dgr in computr scinc and nginring from th Institut of Tchnology, Banaras Hindu Univrsity, Varanasi, India, 1987, th MTch dgr in computr scinc and nginring from th Indian Institut of Tchnology, Kanpur, 1989, and th MS and PhD dgrs in computr and information scinc from Th hio Stat Univrsity, Columbus, hio, in 1991 and 1995, rspctivly H is currntly an assistant profssor in th Dpartmnt of Computr Scinc at Colorado Stat Univrsity, Colorado Prior to joining Colorado Stat Univrsity, h hld rsarch and taching positions at Duk Univrsity and hio Univrsity His rsarch intrsts includ paralll and distributd computing, compilrs, and mobil computing Dr Gupta is a mmbr of th ACM and th IEEE

Higher order derivatives

Higher order derivatives Robrto s Nots on Diffrntial Calculus Chaptr 4: Basic diffrntiation ruls Sction 7 Highr ordr drivativs What you nd to know alrady: Basic diffrntiation ruls. What you can larn hr: How to rpat th procss of