COMP 633 - Paallel Computing Lectue 8 Septembe 14, 2017 SMM (3) OpenMP Case Study: The Banes-Hut N-body Algoithm
Topics Case study: the Banes-Hut algoithm Study an impotant algoithm in scientific computing» n-body simulation with long ange foces Investigate paallelization and implementation in a shaed memoy multipocesso» expession and management of paallelism» memoy hieachy tuning 2
N-body simulations: self-gavitating systems 3
The n-body simulation poblem Simulate the evolution of a system of n bodies ove time Paiwise inteaction of bodies» foce f(i,j) on body i due to body j» total foce f(i) on body i due to all bodies» acceleation of body i via f = ma Numeical integation of body velocities and positions» timestep t Non-negligible long-ange foces fo unifomly distibuted bodies in 3D, total foce due to all bodies at a given distance is constant» cannot ignoe contibution of distant bodies Examples astophysics (gavity) molecula dynamics (electostatics) Ex: Gavitation ij = p i p j f ( i, mi m j) G 2 f ( i) ji f ( i, p the basic simulation algoithm: p while (t < t Final ) do foall 1 i n do compute foce f(i) on body i end update velocity and position of all bodies t = t + t end Diect appoach: O(n²) inteactions pe time-step ij j) j i ij j 4
Reducing the numbe of inteactions Exploit combined effect of distant bodies Fomally Eath d Cente of mass c Total mass M Andomeda Monopole appoximation of the foce on the eath due to inteaction with all masses in the Andomeda galaxy f ( b eath m ) G eath M ( peath c) 3 Vulcan Monopole appoximation saves wok if it can be eused with multiple bodies d apply this idea ecusively: detemines contol-stuctue equies hieachical decomposition of space Accuacy of appoximation impoves with inceasing deceasing d ode of the appoximation» Monopole, dipole, quadopole, unifomity of body distibution 5
Hieachical decomposition of space a quadtee an octee decomposition an adaptive quadtee 6
The Banes-Hut algoithm stepsystem(): // P(i) is coodinates and mass of body i T := maketee(p(1:n)) foall 1 i n do f(i) = gavcalc(p(i),t) update velocities and positions inteaction in the case of gavitation: m F G x m p 2 pq p pq q x q, y p y pq q, z p z pq q function gavcalc(body p,teenode q) if ( q is a leaf ) then etun body-body inteaction (p,q) else if ( p is distant enough fom q ) then etun body-cell inteaction (p,q) else foall q nonemptychilden(q) do accumulate gavcalc(p,q ) etun accumulated inteaction end if end if pq ( x p x q ) 2 ( y body-body inteaction: use masses of bodies and distance between them. body-cell inteaction: use mass of body and mass of cell and distance between body and cente of mass of cell. foce is additive; individual contibutions can be accumulated. p y q ) 2 ( z p z q ) 2 7
The Banes-Hut algoithm - Pefomance issues stepsystem(p(1:n)) -- P(1:n) is sequence of bodies T := maketee(p(1:n)) foall 1 i n do f(i) := gavcalc(p(i),t) update velocities and positions function gavcalc(p,q) if ( q is a leaf ) then etun body-body inteaction else if ( p is distant enough fom q ) then etun body-cell inteaction else foall q nonemptychilden(q) do accumulate gavcalc(p,q ) etun accumulated inteaction end if end if Paallelism nested paallelism ove bodies ove ecusively divided cells load balance diffeent numbe of inteactions fo diffeent bodies Locality neaby bodies inteact with simila set of nodes in tee 8
Constucting the tee Small faction f of the total wok but sequential tee constuction can limit oveall speedup» Amdahl s law: SP < 1/f function maketee( P(1:n) ) fo i := 1 to n do T := inset(p(i),t) compute monopole appoximation at each node Computing monopole appoximation fo each cell Post-ode tavesal of tee» At leaves, monopole coincides with single body» At inteio nodes, monopole is weighted sum of all childen s monopoles function inset(p,t) if empty(t) then etun p as singleton tee else detemine child S of T in which p belongs S := inset(p,s) etun T with S eplaced by S endif 9
The acceptance citeion when is a cell distant enough? oiginal citeion used by Banes-Hut: Eath d Cente of mass d d Andomeda whee usually 0.7 1.0 poblem: detonating galaxy anomaly d seconday galaxy (one) solution: add distance between cente of mass (cm) and geometic cente of cell (c) Cente of mass ~ d ~ d 2 (2D) 3 (3D) d d 2 0. 7 d cm c pimay galaxy 10
Effects of acceptance citeion on untime Souce: L. Henquist. Pefomance chaacteistics of tee codes. Astophysical Jounal Supplement Seies, Vol. 64, Pages 715-734, 1987. 11
Effects of acceptance citeion on accuacy Souce: L. Henquist. Pefomance chaacteistics of tee codes. Astophysical Jounal Supplement Seies, Vol. 64, Pages 715-734, 1987. 1% accuacy sufficient fo most astophysical simulations. Diffeent techniques with bette eo contol necessay fo othe systems (fast multipole methods). 12
Effect of body distibution on total wok Unifom distibution Plumme distibution Fo fixed n unifom distibutions geneate high inteaction wok (shallow tees) non-unifom distibutions geneate highe tee constuction and lowe inteaction wok 13
Complexity of Banes-Hut Tee building cost of tee constuction depends on paticle distibution» cost of body insetion distance to oot» fo a unifom distibution of n paticles, sequential constuction of the tee is O(n log n) time In a simulation, tee could be maintained athe than econstucted each time step Foce calculation (unifom distibution of bodies in 2D) conside computing the foce acting on a body in the lowe ight cone if = 1.0 the 3 undivided top-level squaes will satisfy the acceptance citeion The emaining squae does not satisfy the citeion, hence we descend into the next level each level of the tee incus a constant amount of wok while descending along the path to the lowe ight cone fo a unifom distibution of n bodies, the length of the path is O(log 4 n) computing the foces on n bodies is O(n log n) wok non-unifom distibution moe difficult to analyze Accuacy and complexity ae difficult to contol 14
sec Implementation issues - paallelization paallelization of the foce computation loop: SUBROUTINE stepsystem() CALL maketee()!$omp PARALLEL DO SCHEDULE(GUIDED,4) DO i = 1, n CALL gavcalc(i,oot) END DO!$OMP END PARALLEL DO!$OMP PARALLEL DO integate velocities and positions!$omp END PARALLEL DO END SUBROUTINE stepsystem obsevations: foce computation scales easonably up to 16 pocessos dynamic scheduling impotant single pocesso pefomance not impessive Results on O2000 (evans) fo 1M paticles 1800 1600 1400 1200 1000 800 600 400 200 0 1 2 4 8 16 tee constuction 25.759 27.444 29.028 24.334 26.066 foce computation 1568.854 809.294 416.174 196.997 120.664 speedup 1.00 1.94 3.77 7.96 13.00 Pocessos 15
Implementation issues - tuning of gavcalc (1) pefomance analysis of gavcalc shows poo cache euse (90% L1 and 88% L2) poo use of floating point units poo euse of subexpessions compile can t geneate good code? manual tuning of gavcalc inline computation of acceptance citeion inline computation of inteaction euse distance vecto (body-cell) fuse loops significant pefomance impovement! obsevations: 2.5 times faste good scaling bette use of FPUs and bette pediction cache euse (93% L1 and 94% L2) still bad RECURSIVE SUBROUTINE gavcalc(p,q) IF ( q is a body ) THEN compute body-body inteaction; accumulate ELSE IF ( p is distant enough fom q ) THEN compute body-cell inteaction; accumulate ELSE DO q nonemptychilden(q) CALL gavcalc(p,q ) END DO END IF END IF END SUBROUTINE gavcalc Results on O2000 (evans) fo 1M paticles sec 700 600 500 400 300 200 100 0 1 2 4 8 16 tee constuction 19.066 17.878 19.527 15.323 13.686 foce computation 639.961 315.785 164.764 79.049 44.678 speedup 1.00 2.03 3.88 8.10 14.32 Pocessos 16
Implementation issues - tuning of gavcalc (2a) how can we impove cache euse? neighboing bodies in space will most likely inteact with the same cells and bodies! sot bodies accoding to some spatial ode: pecompute spatial ode such as Moton ode o Peano-Hilbet ode o simply ode bodies as they ae encounteed duing a depth-fist teewalk of T Soted bodies may also speed up subsequent tee ebuilding Moton ode Peano-Hilbet ode Tee ode 17
Implementation issues - tuning of gavcalc (2b) obsevations: 30-40% incease in pefomance vey good scaling L2 euse now up at 99.8% L1 still at 93% stepsystem(p(1:n)) T := maketee(p(1:n)) e-ode P(1:n) accoding to T foall 1 i n do f(i) := gavcalc(p(i),t) update velocities and positions Results on O2000 (evans) fo 1M paticles 600 500 400 sec 300 200 100 0 1 2 4 8 16 tee constuction 19.161 14.51 18.524 18.564 19.873 foce computation 495.355 247.89 125.225 62.741 31.281 speedup 1.00 2.00 3.96 7.90 15.84 Pocessos 18
Implementation issues - tuning of gavcalc (3) How can we impove L1 euse? inteact a goup of bodies with a cell o body! walk the tee and compute foces fo a set of neighboing bodies RECURSIVE SUBROUTINE gavcalc(set P,node q) IF ( q is a body ) THEN DO p P compute body-body inteaction; accumulate END DO ELSE P = DO p P IF ( p is distant enough fom q ) THEN compute body-cell inteaction; accumulate ELSE P = P {p} END IF END DO IF (P.NE. ) THEN DO q nonemptychilden(q) CALL gavcalc(p,q ) END DO END IF END IF END SUBROUTINE gavcalc Results on O2000 (evans) fo 1M paticles sec 500 400 300 200 100 0 tee constuction 20.041 19.471 19.824 18.605 13.716 foce computation 421.391 205.309 104.438 51.828 25.805 speedup 1.00 2.05 4.03 8.13 16.33 obsevations: 1 2 4 8 16 Pocessos 20-40% incease in pefomance L1 euse now at 99.7% (32 bodies pe goup) L2 down slightly at 96% odeed paticles essential 19
Implementation issues - tuning of gavcalc (4) Anothe technique to impove L1 euse allow leaf-cells to contain moe than 1 body compute the body-body inteactions in a doubly nested loop. RECURSIVE SUBROUTINE gavcalc(set P, node q) P = DO p P IF ( p is distant enough fom q ) THEN compute body-cell inteaction; accumulate ELSE IF ( q is a leaf ) THEN DO p P, q q compute body-body inteaction; accumulate END DO ELSE P = P {p} END IF END IF END DO IF (P.NE.) THEN DO q nonemptychilden(q) CALL gavcalc(p,q ) END DO END IF END SUBROUTINE gavcalc Results on O2000 (evans) fo 1M paticles sec 400 350 300 250 200 150 100 50 0 tee constuction 13.179 12.494 13.362 12.682 9.536 foce computation 378.345 189.231 94.996 47.866 23.809 speedup 1.00 2.00 3.98 7.90 15.89 obsevations: 1 2 4 8 16 Pocessos 10% incease in pefomance this algoithm will pefom stictly moe wok than the pevious vesions! Moe paticles pe leaf potentially causes moe body-body inteactions and fewe bodycell inteactions to be computed. 20
Implementation issues - summay Shaed memoy model enables elatively simple paallelization of basic algoithm using OpenMP shaed memoy model citical in dynamic load balancing Pefomance tuning oveall these optimizations lead to 4-5 times faste single-pocesso pefomance Linea o supelinea paallel speedup to 16 pocessos optimizing seial pefomance is essential fo obtaining good paallel pefomance last two optimization ae instances of exposing paallelism to impove seial pefomance Obsevations the bette the pefomance of gavcalc the moe seiously the seial teeconstuction affects the oveall speedup» when maketee time is included in speedup speedup dops fom 13.00 to 10.8 fo p = 16 in fist vesion speedup dops fom 15.89 to 11.74 fo p = 16 on last vesion paallel tee constuction algoithms! 21