COMP Parallel Computing SMM (3) OpenMP Case Study: The Barnes-Hut N-body Algorithm

Similar documents
ASTR415: Problem Set #6

7.2. Coulomb s Law. The Electric Force

Stanford University CS259Q: Quantum Computing Handout 8 Luca Trevisan October 18, 2012

Chapter 2: Basic Physics and Math Supplements

Physics 235 Chapter 5. Chapter 5 Gravitation

Determining solar characteristics using planetary data

Inseting this into the left hand side of the equation of motion above gives the most commonly used algoithm in classical molecula dynamics simulations

University Physics (PHY 2326)

Review: Electrostatics and Magnetostatics

Light Time Delay and Apparent Position

Aaa Hal ARC 103 Haq Mou Hill 114 Mug Seh PHY LH Sen Zzz SEC 111

Multifrontal sparse QR factorization on the GPU

Charges, Coulomb s Law, and Electric Fields

Chapter 13 Gravitation

Universal Gravitation

Between any two masses, there exists a mutual attractive force.

Central Force Motion

Circular Orbits. and g =

Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution

Electromagnetism Physics 15b

Chapter 3 Optical Systems with Annular Pupils

The Millikan Experiment: Determining the Elementary Charge

A Bijective Approach to the Permutational Power of a Priority Queue

Chapter 22: Electric Fields. 22-1: What is physics? General physics II (22102) Dr. Iyad SAADEDDIN. 22-2: The Electric Field (E)

ELECTROSTATICS::BHSEC MCQ 1. A. B. C. D.

Physics 107 TUTORIAL ASSIGNMENT #8

Numerical Integration

CHAPTER 25 ELECTRIC POTENTIAL

Do Managers Do Good With Other People s Money? Online Appendix

To Feel a Force Chapter 7 Static equilibrium - torque and friction

Physics 221 Lecture 41 Nonlinear Absorption and Refraction

LINEAR AND NONLINEAR ANALYSES OF A WIND-TUNNEL BALANCE

DEVIL PHYSICS THE BADDEST CLASS ON CAMPUS IB PHYSICS

763620SS STATISTICAL PHYSICS Solutions 2 Autumn 2012

Lecture 8 - Gauss s Law

Conjugate Gradient Methods. Michael Bader. Summer term 2012

The geometric construction of Ewald sphere and Bragg condition:

ITI Introduction to Computing II

PHYS 172: Modern Mechanics. Fall Lecture 6 Fundamental Forces, Reciprocity Read

Objective Notes Summary

SIO 229 Gravity and Geomagnetism. Lecture 6. J 2 for Earth. J 2 in the solar system. A first look at the geoid.

Rydberg-Rydberg Interactions

17.1 Electric Potential Energy. Equipotential Lines. PE = energy associated with an arrangement of objects that exert forces on each other

Chapter 5 Force and Motion

Physics 11 Chapter 4: Forces and Newton s Laws of Motion. Problem Solving

General Railgun Function

Quantum Fourier Transform

Chapter 5 Force and Motion

PHYS 110B - HW #7 Spring 2004, Solutions by David Pace Any referenced equations are from Griffiths Problem statements are paraphrased

AQI: Advanced Quantum Information Lecture 2 (Module 4): Order finding and factoring algorithms February 20, 2013

Particle Systems. University of Texas at Austin CS384G - Computer Graphics Fall 2010 Don Fussell

AST 121S: The origin and evolution of the Universe. Introduction to Mathematical Handout 1

Lecture 7 Topic 5: Multiple Comparisons (means separation)

Coarse Mesh Radiation Transport Code COMET Radiation Therapy Application*

MATH 415, WEEK 3: Parameter-Dependence and Bifurcations

Multipole Radiation. February 29, The electromagnetic field of an isolated, oscillating source

Magnetic Field. Conference 6. Physics 102 General Physics II

CSCE 478/878 Lecture 4: Experimental Design and Analysis. Stephen Scott. 3 Building a tree on the training set Introduction. Outline.

Research Design - - Topic 17 Multiple Regression & Multiple Correlation: Two Predictors 2009 R.C. Gardner, Ph.D.

Δt The textbook chooses to say that the average velocity is

When a mass moves because of a force, we can define several types of problem.

On the Sun s Electric-Field

PHYS 2135 Exam I February 13, 2018

Solutions to Problems : Chapter 19 Problems appeared on the end of chapter 19 of the Textbook

Phys101 Lectures 30, 31. Wave Motion

Chapter 7-8 Rotational Motion

Splay Trees Handout. Last time we discussed amortized analysis of data structures

Fresnel Diffraction. monchromatic light source

r cos, and y r sin with the origin of coordinate system located at

Physics: Work & Energy Beyond Earth Guided Inquiry

Plug-and-Play Dual-Tree Algorithm Runtime Analysis

OSCILLATIONS AND GRAVITATION

Math 2263 Solutions for Spring 2003 Final Exam

10. Force is inversely proportional to distance between the centers squared. R 4 = F 16 E 11.

F g. = G mm. m 1. = 7.0 kg m 2. = 5.5 kg r = 0.60 m G = N m 2 kg 2 = = N

Hypothesis of dark matter and dark energy with minus mass

DOING PHYSICS WITH MATLAB COMPUTATIONAL OPTICS

! E da = 4πkQ enc, has E under the integral sign, so it is not ordinarily an

Section 11. Timescales Radiation transport in stars

3.2 Centripetal Acceleration

High precision computer simulation of cyclotrons KARAMYSHEVA T., AMIRKHANOV I. MALININ V., POPOV D.

- 5 - TEST 1R. This is the repeat version of TEST 1, which was held during Session.

Physics 2212 GH Quiz #2 Solutions Spring 2016

C/CS/Phys C191 Shor s order (period) finding algorithm and factoring 11/12/14 Fall 2014 Lecture 22

Lecture 24 Stability of Molecular Clouds

TAMPINES JUNIOR COLLEGE 2009 JC1 H2 PHYSICS GRAVITATIONAL FIELD

B. Spherical Wave Propagation

Flux. Area Vector. Flux of Electric Field. Gauss s Law

LET a random variable x follows the two - parameter

Name. Date. Period. Engage Examine the pictures on the left. 1. What is going on in these pictures?

Force can be exerted by direct contact between bodies: Contact Force.

EELE 3331 Electromagnetic I Chapter 4. Electrostatic fields. Islamic University of Gaza Electrical Engineering Department Dr.

AP Physics Electric Potential Energy

Uniform Circular Motion

PHYSICS 151 Notes for Online Lecture #20

Introduction to Arrays

Centripetal Force OBJECTIVE INTRODUCTION APPARATUS THEORY

Chapter 4. Newton s Laws of Motion

Physics 111 Lecture 5 Circular Motion

15 Solving the Laplace equation by Fourier method

Transcription:

COMP 633 - Paallel Computing Lectue 8 Septembe 14, 2017 SMM (3) OpenMP Case Study: The Banes-Hut N-body Algoithm

Topics Case study: the Banes-Hut algoithm Study an impotant algoithm in scientific computing» n-body simulation with long ange foces Investigate paallelization and implementation in a shaed memoy multipocesso» expession and management of paallelism» memoy hieachy tuning 2

N-body simulations: self-gavitating systems 3

The n-body simulation poblem Simulate the evolution of a system of n bodies ove time Paiwise inteaction of bodies» foce f(i,j) on body i due to body j» total foce f(i) on body i due to all bodies» acceleation of body i via f = ma Numeical integation of body velocities and positions» timestep t Non-negligible long-ange foces fo unifomly distibuted bodies in 3D, total foce due to all bodies at a given distance is constant» cannot ignoe contibution of distant bodies Examples astophysics (gavity) molecula dynamics (electostatics) Ex: Gavitation ij = p i p j f ( i, mi m j) G 2 f ( i) ji f ( i, p the basic simulation algoithm: p while (t < t Final ) do foall 1 i n do compute foce f(i) on body i end update velocity and position of all bodies t = t + t end Diect appoach: O(n²) inteactions pe time-step ij j) j i ij j 4

Reducing the numbe of inteactions Exploit combined effect of distant bodies Fomally Eath d Cente of mass c Total mass M Andomeda Monopole appoximation of the foce on the eath due to inteaction with all masses in the Andomeda galaxy f ( b eath m ) G eath M ( peath c) 3 Vulcan Monopole appoximation saves wok if it can be eused with multiple bodies d apply this idea ecusively: detemines contol-stuctue equies hieachical decomposition of space Accuacy of appoximation impoves with inceasing deceasing d ode of the appoximation» Monopole, dipole, quadopole, unifomity of body distibution 5

Hieachical decomposition of space a quadtee an octee decomposition an adaptive quadtee 6

The Banes-Hut algoithm stepsystem(): // P(i) is coodinates and mass of body i T := maketee(p(1:n)) foall 1 i n do f(i) = gavcalc(p(i),t) update velocities and positions inteaction in the case of gavitation: m F G x m p 2 pq p pq q x q, y p y pq q, z p z pq q function gavcalc(body p,teenode q) if ( q is a leaf ) then etun body-body inteaction (p,q) else if ( p is distant enough fom q ) then etun body-cell inteaction (p,q) else foall q nonemptychilden(q) do accumulate gavcalc(p,q ) etun accumulated inteaction end if end if pq ( x p x q ) 2 ( y body-body inteaction: use masses of bodies and distance between them. body-cell inteaction: use mass of body and mass of cell and distance between body and cente of mass of cell. foce is additive; individual contibutions can be accumulated. p y q ) 2 ( z p z q ) 2 7

The Banes-Hut algoithm - Pefomance issues stepsystem(p(1:n)) -- P(1:n) is sequence of bodies T := maketee(p(1:n)) foall 1 i n do f(i) := gavcalc(p(i),t) update velocities and positions function gavcalc(p,q) if ( q is a leaf ) then etun body-body inteaction else if ( p is distant enough fom q ) then etun body-cell inteaction else foall q nonemptychilden(q) do accumulate gavcalc(p,q ) etun accumulated inteaction end if end if Paallelism nested paallelism ove bodies ove ecusively divided cells load balance diffeent numbe of inteactions fo diffeent bodies Locality neaby bodies inteact with simila set of nodes in tee 8

Constucting the tee Small faction f of the total wok but sequential tee constuction can limit oveall speedup» Amdahl s law: SP < 1/f function maketee( P(1:n) ) fo i := 1 to n do T := inset(p(i),t) compute monopole appoximation at each node Computing monopole appoximation fo each cell Post-ode tavesal of tee» At leaves, monopole coincides with single body» At inteio nodes, monopole is weighted sum of all childen s monopoles function inset(p,t) if empty(t) then etun p as singleton tee else detemine child S of T in which p belongs S := inset(p,s) etun T with S eplaced by S endif 9

The acceptance citeion when is a cell distant enough? oiginal citeion used by Banes-Hut: Eath d Cente of mass d d Andomeda whee usually 0.7 1.0 poblem: detonating galaxy anomaly d seconday galaxy (one) solution: add distance between cente of mass (cm) and geometic cente of cell (c) Cente of mass ~ d ~ d 2 (2D) 3 (3D) d d 2 0. 7 d cm c pimay galaxy 10

Effects of acceptance citeion on untime Souce: L. Henquist. Pefomance chaacteistics of tee codes. Astophysical Jounal Supplement Seies, Vol. 64, Pages 715-734, 1987. 11

Effects of acceptance citeion on accuacy Souce: L. Henquist. Pefomance chaacteistics of tee codes. Astophysical Jounal Supplement Seies, Vol. 64, Pages 715-734, 1987. 1% accuacy sufficient fo most astophysical simulations. Diffeent techniques with bette eo contol necessay fo othe systems (fast multipole methods). 12

Effect of body distibution on total wok Unifom distibution Plumme distibution Fo fixed n unifom distibutions geneate high inteaction wok (shallow tees) non-unifom distibutions geneate highe tee constuction and lowe inteaction wok 13

Complexity of Banes-Hut Tee building cost of tee constuction depends on paticle distibution» cost of body insetion distance to oot» fo a unifom distibution of n paticles, sequential constuction of the tee is O(n log n) time In a simulation, tee could be maintained athe than econstucted each time step Foce calculation (unifom distibution of bodies in 2D) conside computing the foce acting on a body in the lowe ight cone if = 1.0 the 3 undivided top-level squaes will satisfy the acceptance citeion The emaining squae does not satisfy the citeion, hence we descend into the next level each level of the tee incus a constant amount of wok while descending along the path to the lowe ight cone fo a unifom distibution of n bodies, the length of the path is O(log 4 n) computing the foces on n bodies is O(n log n) wok non-unifom distibution moe difficult to analyze Accuacy and complexity ae difficult to contol 14

sec Implementation issues - paallelization paallelization of the foce computation loop: SUBROUTINE stepsystem() CALL maketee()!$omp PARALLEL DO SCHEDULE(GUIDED,4) DO i = 1, n CALL gavcalc(i,oot) END DO!$OMP END PARALLEL DO!$OMP PARALLEL DO integate velocities and positions!$omp END PARALLEL DO END SUBROUTINE stepsystem obsevations: foce computation scales easonably up to 16 pocessos dynamic scheduling impotant single pocesso pefomance not impessive Results on O2000 (evans) fo 1M paticles 1800 1600 1400 1200 1000 800 600 400 200 0 1 2 4 8 16 tee constuction 25.759 27.444 29.028 24.334 26.066 foce computation 1568.854 809.294 416.174 196.997 120.664 speedup 1.00 1.94 3.77 7.96 13.00 Pocessos 15

Implementation issues - tuning of gavcalc (1) pefomance analysis of gavcalc shows poo cache euse (90% L1 and 88% L2) poo use of floating point units poo euse of subexpessions compile can t geneate good code? manual tuning of gavcalc inline computation of acceptance citeion inline computation of inteaction euse distance vecto (body-cell) fuse loops significant pefomance impovement! obsevations: 2.5 times faste good scaling bette use of FPUs and bette pediction cache euse (93% L1 and 94% L2) still bad RECURSIVE SUBROUTINE gavcalc(p,q) IF ( q is a body ) THEN compute body-body inteaction; accumulate ELSE IF ( p is distant enough fom q ) THEN compute body-cell inteaction; accumulate ELSE DO q nonemptychilden(q) CALL gavcalc(p,q ) END DO END IF END IF END SUBROUTINE gavcalc Results on O2000 (evans) fo 1M paticles sec 700 600 500 400 300 200 100 0 1 2 4 8 16 tee constuction 19.066 17.878 19.527 15.323 13.686 foce computation 639.961 315.785 164.764 79.049 44.678 speedup 1.00 2.03 3.88 8.10 14.32 Pocessos 16

Implementation issues - tuning of gavcalc (2a) how can we impove cache euse? neighboing bodies in space will most likely inteact with the same cells and bodies! sot bodies accoding to some spatial ode: pecompute spatial ode such as Moton ode o Peano-Hilbet ode o simply ode bodies as they ae encounteed duing a depth-fist teewalk of T Soted bodies may also speed up subsequent tee ebuilding Moton ode Peano-Hilbet ode Tee ode 17

Implementation issues - tuning of gavcalc (2b) obsevations: 30-40% incease in pefomance vey good scaling L2 euse now up at 99.8% L1 still at 93% stepsystem(p(1:n)) T := maketee(p(1:n)) e-ode P(1:n) accoding to T foall 1 i n do f(i) := gavcalc(p(i),t) update velocities and positions Results on O2000 (evans) fo 1M paticles 600 500 400 sec 300 200 100 0 1 2 4 8 16 tee constuction 19.161 14.51 18.524 18.564 19.873 foce computation 495.355 247.89 125.225 62.741 31.281 speedup 1.00 2.00 3.96 7.90 15.84 Pocessos 18

Implementation issues - tuning of gavcalc (3) How can we impove L1 euse? inteact a goup of bodies with a cell o body! walk the tee and compute foces fo a set of neighboing bodies RECURSIVE SUBROUTINE gavcalc(set P,node q) IF ( q is a body ) THEN DO p P compute body-body inteaction; accumulate END DO ELSE P = DO p P IF ( p is distant enough fom q ) THEN compute body-cell inteaction; accumulate ELSE P = P {p} END IF END DO IF (P.NE. ) THEN DO q nonemptychilden(q) CALL gavcalc(p,q ) END DO END IF END IF END SUBROUTINE gavcalc Results on O2000 (evans) fo 1M paticles sec 500 400 300 200 100 0 tee constuction 20.041 19.471 19.824 18.605 13.716 foce computation 421.391 205.309 104.438 51.828 25.805 speedup 1.00 2.05 4.03 8.13 16.33 obsevations: 1 2 4 8 16 Pocessos 20-40% incease in pefomance L1 euse now at 99.7% (32 bodies pe goup) L2 down slightly at 96% odeed paticles essential 19

Implementation issues - tuning of gavcalc (4) Anothe technique to impove L1 euse allow leaf-cells to contain moe than 1 body compute the body-body inteactions in a doubly nested loop. RECURSIVE SUBROUTINE gavcalc(set P, node q) P = DO p P IF ( p is distant enough fom q ) THEN compute body-cell inteaction; accumulate ELSE IF ( q is a leaf ) THEN DO p P, q q compute body-body inteaction; accumulate END DO ELSE P = P {p} END IF END IF END DO IF (P.NE.) THEN DO q nonemptychilden(q) CALL gavcalc(p,q ) END DO END IF END SUBROUTINE gavcalc Results on O2000 (evans) fo 1M paticles sec 400 350 300 250 200 150 100 50 0 tee constuction 13.179 12.494 13.362 12.682 9.536 foce computation 378.345 189.231 94.996 47.866 23.809 speedup 1.00 2.00 3.98 7.90 15.89 obsevations: 1 2 4 8 16 Pocessos 10% incease in pefomance this algoithm will pefom stictly moe wok than the pevious vesions! Moe paticles pe leaf potentially causes moe body-body inteactions and fewe bodycell inteactions to be computed. 20

Implementation issues - summay Shaed memoy model enables elatively simple paallelization of basic algoithm using OpenMP shaed memoy model citical in dynamic load balancing Pefomance tuning oveall these optimizations lead to 4-5 times faste single-pocesso pefomance Linea o supelinea paallel speedup to 16 pocessos optimizing seial pefomance is essential fo obtaining good paallel pefomance last two optimization ae instances of exposing paallelism to impove seial pefomance Obsevations the bette the pefomance of gavcalc the moe seiously the seial teeconstuction affects the oveall speedup» when maketee time is included in speedup speedup dops fom 13.00 to 10.8 fo p = 16 in fist vesion speedup dops fom 15.89 to 11.74 fo p = 16 on last vesion paallel tee constuction algoithms! 21