Three hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Wednesday 16th January 2013 Time: 09:45-12:45

Three hours UNIVERSITY OF MANHESTER SHOOL OF OMPUTER SIENE Prllel Progrms nd ther Performnce Dte: Wednesdy 16th Jnury 2013 Tme: 09:45-12:45 Plese nswer ny TWO Questons from the FOUR Questons provded. Ths s n OPEN ook exmnton. The use of electronc clcultors s permtted provded they re not progrmmle nd do not store text. [PTO]

Queston 1 onsder the followng frgments of code tht perform some smple numercl lner lger computtons (vector xpy operton (lnked vector ddton nd vector sclng), nd the multplcton of two lower trngulr mtrces): =x y B LM, where s sclr,, x, y re vectors of length n (n cn e ssumed to e lrge) nd B, L, M re n n, lower trngulr, mtrces. (A lower trngulr mtrx s one n whch ll the elements ove the dgonl re zero, L j 0, j.) ) The followng FORTRAN code ntlses the vectors x, y nd mplements the vector xpy operton. vector ntlston DO =1,n x() = rnd() y() = rnd() vector xpy DO =1,n () = lph*x() + y() Identfy, wthout reference to ny prtculr prllel rchtecture, the nture of ny prllel work n the loops ove. (3 mrks) ) One mplementton (mplementton A) prllelses the second loop (the xpy operton) y ncludng the OMP prgm $omp prllel do schedule(sttc) mmedtely efore the second DO sttement, nd second mplementton (mplementton B) prllelses oth loops y ncludng the sme prgm efore ech of the DO sttements. The executon tme (n seconds) of these two mplementtons on 1 8 cores of chronos ( 16-core AMD Opteron-sed server) (these tmngs exclude the ntlston loop n ech cse) s s follows: Pge 2 of 10

No of ores Implementton A Implementton B 1 0.4594 0.4597 2 0.5907 0.2308 3 0.4054 0.1554 4 0.3638 0.1182 6 0.2724 0.08048 8 0.2451 0.06411 Expln these results n terms of prllel overheds. (7 mrks) c) The followng FORTRAN code frgment clcultes the lower trngulr mtrx product: Mtrx multplcton DO j = 1,n DO = j,n B(,j) = 0.0 DO k = j, B(,j) = B(,j)+L(,k)*M(k,j) Identfy the nture (nd lmttons) of ny prllel work n the ove clculton s t s wrtten. Suggest prllel mplementton of the ove clculton sutle for qud qud core pltform such s chronos clerly dentfy ll the potentl overheds nd nclude consderton of the ntlston of the rrys L nd M. (10 mrks) Pge 3 of 10

Queston 2 ) Expln wht s ment y the executon tme overheds of prllel progrm (you should clerly dentfy ech dfferent knd of overhed you mght expect to occur). Descre how these overheds ffect executon of the prllel progrm. (5 mrks) The followng OpenMP/Fortrn-lke pseudocode (for emphss, the OpenMP drectves re on the left nd the Fortrn code on the rght) mplements prllel dvde-nd-conquer lgorthm usng shred stck to hold the outstndng jos tht need to e computed. The suroutne POP returns the specl vlue NULL f t s executed when the stck s empty. DO PARALLEL SHARED STAK, OUTPUT, TERMINATED PRIVATE JOB, JOB1, JOB2, RESULT RITIAL (STAK) END RITIAL RITIAL (STAK) END RITIAL RITIAL (OUTPUT) END RITIAL RITIAL (TERMINATED) END RITIAL DO WHILE (.NOT. TERMINATED) POP(TOP OF STAK INTO JOB) IF (JOB.NE. NULL) IF (JOB IS LARGE) REATE 2 SUBJOBS, JOB1 & JOB2 PUSH(JOB1 ONTO STAK) PUSH(JOB2 ONTO STAK) ELSE OMPUTE RESULT (OF JOB) ADD RESULT TO OUTPUT END IF ELSE END IF OMPUTE TERMINATED WHILE ) Expln wht needs to e done when the shred termnton condton TERMINATED s computed. Brefly descre strtegy for mplementng ths. (2 mrks) Pge 4 of 10

c) Expln clerly wht you expect to e the mn source of prllel executon tme overhed for ths code. Stte your ssumptons out the ehvour of ech prt of the lgorthm, nd mke t cler wht you expect to hppen s the tme to compute RESULT ncreses from eng reltvely short to eng reltvely long, compred wth the rest of the necessry work. (5 mrks) d) A progrmmer on your tem suggests the followng chnge to the ove pseudocode: nsted of pushng oth new sujos onto the stck, push only one of them nd then execute the other n the exstng thred. Gve new pseudocode (n the sme style s ove) tht cheves ths. Wht effect do you expect ths chnge to hve on the executon tme overheds you dentfed erler? (4 mrks) e) Dscuss the dffcultes tht would need to e ddressed f P stcks (s opposed to sngle shred stck) were used n P-fold prllel mplementton of ths lgorthm. Wht effect do you expect such chnge to hve on the executon tme overheds you dentfed erler? (4 mrks) Pge 5 of 10

Queston 3 ) onsder the frst order lner recurrence x d, 1 1 x x d, 2,3,, NMAX. 1 Show tht, y tertng the recurrence twce, one cn otn the recurrence x d, x x d, x x d, x x d, 1 1 2 2 1 2 3 3 2 3 4 4 3 4 x ˆ x dˆ, 5,6,, NMAX. 4 nd therey expose 4-fold prllelsm n ths computton. You should clerly derve expressons for ˆ nd ˆ d. (6 mrks) ) onsder now the trdgonl system Ax = y, (4.1) where A s the (symmetrc) trdgonl mtrx 1 2 A 2 2 3 3 3 n. n n Pge 6 of 10

) A cyclc reducton lgorthm results from the followng: usng equtons - 1, + 1 of (4.1) to elmnte x -1, x +1, respectvely, from the th equton of (4.1) show tht the trdgonl system (4.1) cn e replced y where A x = y, (4.2) A 1 0 3 0 2 0 4 3 0 3 n 4 n1 0, n 0 n nd otn expressons for the elements of A nd y. (4 mrks) ) In smlr wy, show tht equtons - 2, + 2 of (4.2) cn e used to elmnte x -2, x +2 respectvely, from the th equton of (4.2) to otn A (2) x = y (2), where the elements of A (2) nd y (2) re sutly defned. (3 mrks) ) Indcte how ths procedure my e contnued nd show tht N = log 2 n stges wll e requred to reduce the system of equtons to dgonl form. (4 mrks) v) "Ths cyclc reducton lgorthm s uncompettve on serl computers, ut hs ecome populr for mplementton on prllel computers." Expln ths sttement. (3 mrks) Pge 7 of 10

Queston 4 A stellr system s to e modelled usng 3-dmensonl, N-ody, tertve tme-steppng smulton of the effects of grvttonl ttrcton (gnorng collsons). The grvttonl force F ctng on str s due to str s j ( j) s gven y: F = G m * m j / r j 2, where G s constnt, m s the mss of str s, nd r j s the dstnce etween s nd s j. Also, the ccelerton of str s, under force F s: = F / m. The overll nture of the smulton s descred n the followng pseudo-code n whch the type TRIPLE REAL ARRAY s n rry of records of three REAL vlues. In rry postons, the three vlues represent the x, y, z coordntes of the correspondng str durng the current tme-step. Smlrly, rry veloctes represents the current u, v, w veloctes of the correspondng str, n the x, y, z drectons, respectvely, nd rry forces represents the F x, F y, F z force components currently ctng on the correspondng str, n the x, y, z drectons, respectvely. Suroutne prmeters tht re updted s result of cll re underlned n the pseudo-code elow; otherwse suroutne prmeters re red-only. PROGRAM grvttonl_n-ody_clculton REAL ARRAY msses (1:10000) TRIPLE REAL ARRAY postons (1:10000) TRIPLE REAL ARRAY veloctes (1:10000) TRIPLE REAL ARRAY forces (1:10000) INTEGER step REAL t, delt_t t=0.0 READ(delt_t) delt_t, the tme step sze, s progrm nput ALL ntlse (msses,postons,veloctes) ntlse plces ll 10000 strs n ntl postons nd gves them veloctes nd msses chosen t rndom from pproprte dstrutons repet tme steppng loop 1000000 tmes FOR step=1 TO 1000000 DO ALL clculte_forces (msses,postons,forces) clculte_forces determnes the forces on ech str on the ss of the current postons of the strs Pge 8 of 10

ALL move_strs (msses,forces,postons,veloctes move_strs clcultes new postons nd veloctes, t tme t+delt_t, for ech of the strs, under the clculted forces t=t+delt_t updte the tme nd repet END PROGRAM Pseudo-code for strghtforwrd mplementton of the suroutne clculte_forces s gven elow: SUBROUTINE clculte_forces(m,p,f) REAL ARRAY m(1:10000) TRIPLE REAL ARRAY p(1:10000) TRIPLE REAL ARRAY f(1:10000) INTEGER, j FOR =1 to 10000 DO Zero_The_3_omponents_Of_f() FOR j=1 TO 10000 DO IF j.ne. THEN lculte_the_3_force_omponents_ & At_Str_s()_Due_To_Str_s(j) Add_The_lculted_omponents_Into_f() END IF END SUBROUTINE ) Gve pseudo-code for the suroutnes ntlse nd move_strs. (6 mrks) ) omment on the nture of potentl prllel executons of the three mn suroutnes, ntlse, clculte_forces nd move_strs, sttng ny ssumptons you mke. Hence, suggest generl strtegy for prllelsng the whole progrm. (6 mrks) Pge 9 of 10

c) The gven suroutne for clculte_forces s neffcent. Snce grvty s symmetrcl force, the forces() components, ctng on str s, tht re due to str s j re equl n vlue, ut opposte n sense, to the forces(j) components, ctng on str s j, tht re due to str s. Hence, these vlues, whch re computed twce durng ech cycle n the gven code, could, n prncple, e clculted just once per cycle. Gve lterntve pseudo-code for the suroutne clculte_forces tht mplements ths optmston. (4 mrks) d) omment on the nture of potentl prllel executons of your code for prt c), nd expln how you would ttempt to orgnse ny ctul prllel executon so s to cheve hgh performnce. (4 mrks) END OF EXAMINATION Pge 10 of 10