Lecture 4. Writing parallel programs with MPI Measuring performance

Similar documents
Modelling and implementation of algorithms in applied mathematics using MPI

Finite difference methods. Finite difference methods p. 1

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Lecture 24 - MPI ECE 459: Programming for Performance

1 / 28. Parallel Programming.

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

CSE 160 Lecture 13. Numerical Linear Algebra

Image Reconstruction And Poisson s equation

Lecture 9 Approximations of Laplace s Equation, Finite Element Method. Mathématiques appliquées (MATH0504-1) B. Dewals, C.

Overview: Synchronous Computations

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms

Lecture 19. Architectural Directions

The Sieve of Erastothenes

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallel Program Performance Analysis

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Distributed Memory Programming With MPI

Parallel Programming & Cluster Computing Transport Codes and Shifting Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa

CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers

The Performance Evolution of the Parallel Ocean Program on the Cray X1

Solution of Linear Systems

Parallelization of the QC-lib Quantum Computer Simulator Library

High Performance Computing

Program Performance Metrics

A quick introduction to MPI (Message Passing Interface)

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Parallel Programming. Parallel algorithms Linear systems solvers

The equation does not have a closed form solution, but we see that it changes sign in [-1,0]

Scalability of Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults

Mon Jan Improved acceleration models: linear and quadratic drag forces. Announcements: Warm-up Exercise:

CSE613: Parallel Programming, Spring 2012 Date: May 11. Final Exam. ( 11:15 AM 1:45 PM : 150 Minutes )

Introduction. Math 1080: Numerical Linear Algebra Chapter 4, Iterative Methods. Example: First Order Richardson. Strategy

Math 1080: Numerical Linear Algebra Chapter 4, Iterative Methods

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Introduction to MPI. School of Computational Science, 21 October 2005

Logic BIST. Sungho Kang Yonsei University

Parallel Programming in C with MPI and OpenMP

4: Some Model Programming Tasks

Tight Bounds on the Diameter of Gaussian Cubes

Kasetsart University Workshop. Multigrid methods: An introduction

Two case studies of Monte Carlo simulation on GPU

ECE 669 Parallel Computer Architecture

Review for the Midterm Exam

CSE Introduction to Parallel Processing. Chapter 2. A Taste of Parallel Algorithms

CS Data Structures and Algorithm Analysis

Introductory MPI June 2008

CEE 618 Scientific Parallel Computing (Lecture 4): Message-Passing Interface (MPI)

Review: From problem to parallel algorithm

Lecture 16: Communication Complexity

On the Use of a Many core Processor for Computational Fluid Dynamics Simulations

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Lecture 4: Linear Algebra 1

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Special Nodes for Interface

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

Final year project. Methods for solving differential equations

() Chapter 8 November 9, / 1

Poisson Equation in 2D

arxiv: v1 [hep-lat] 7 Oct 2010

Extending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations

Parallel Discontinuous Galerkin Method

Distributed Systems Byzantine Agreement

MATH 333: Partial Differential Equations

Performance and Scalability. Lars Karlsson

3D Parallel Elastodynamic Modeling of Large Subduction Earthquakes

Chapter 2 Solutions of Equations of One Variable

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

Projectile Motion Slide 1/16. Projectile Motion. Fall Semester. Parallel Computing

NUMERICAL ALGORITHMS FOR A SECOND ORDER ELLIPTIC BVP

Lecture 7 Symbolic Computations

Cache-Oblivious Algorithms

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

Solving PDEs with CUDA Jonathan Cohen

Section 1.1 Algorithms. Key terms: Algorithm definition. Example from Trapezoidal Rule Outline of corresponding algorithm Absolute Error

Cache-Oblivious Algorithms

Algorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen

CS Lecture 4. Markov Random Fields

Modeling Concurrent Systems

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Equalities and Uninterpreted Functions. Chapter 3. Decision Procedures. An Algorithmic Point of View. Revision 1.0

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

It ain t no good if it ain t snappy enough. (Efficient Computations) COS 116, Spring 2011 Sanjeev Arora

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Convergence Models and Surprising Results for the Asynchronous Jacobi Method

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Dense Arithmetic over Finite Fields with CUMODP

Computer Architecture

Generating Equidistributed Meshes in 2D via Domain Decomposition

A NUMERICAL ITERATIVE SCHEME FOR COMPUTING FINITE ORDER RANK-ONE CONVEX ENVELOPES

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

Finite Difference Methods for Boundary Value Problems

Distributed Memory Parallelization in NGSolve

Introduction. How can we say that one algorithm performs better than another? Quantify the resources required to execute:

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

CS115 Computer Simulation Project list

Lab 11: Numerical Integration Techniques. Figure 1. From the Fundamental Theorem of Calculus, we know that if we want to calculate f ( x)

Transcription:

Lecture 4 Writing parallel programs with MPI Measuring performance

Announcements Wednesday s office hour moved to 1.30 A new version of Ring (Ring_new) that handles linear sequences of message lengths Project proposals due on Tuesday (10/10) 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 2

Microbenchmark to determine communication cost model parameters Communication cost model: transfer time = + n = message startup time = inverse peak bandwidth + n Sender Recvr 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 3

Communication Bandwidth (off node) on Blue Horizon 390 MB/sec N = 4MB n 1/2 ~ 96KB 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 4

Communication times 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 5

The Ring program We configure the processors in a logical ring and pass messages around the ring multiple times Assume there are p processors Neighbors of processor k are ( k + 1 ) mod p ( k + p -1 ) mod p p0 p3 p1 p2 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 6

Measurement technique with Ring for (int len = 1, l=0; len <= maxsize; len *= 2, l++) if (myid == 0) { // (WARM UP CODE) const double start = MPI_Wtime( ); for (int i = 0; i < trips; i++) { PROCESSOR 0 CODE } const double delta = MPI_Wtime( ) - start; Bandwidth = (long)((trips*len*nodes)/ delta /1000.0); } else { // myid!= 0 // (WARM UP CODE) for (int i = 0; i < trips; i++) { ALL OTHER PROCESSORS } } 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 7

The Ring program Processor 0: MPI_Request req; MPI_Irecv(buffer, len, MPI_CHAR, (rank + p - 1)%p, tag, MPI_COMM_WORLD, &req); MPI_Send(buffer, len, MPI_CHAR, (rank + 1) % p, tag, MPI_COMM_WORLD); MPI_Status status; MPI_Wait(&req,&status); All others: MPI_Status status1; MPI_Recv(buffer, len, MPI_CHAR, (rank + p - 1)%p, tag, MPI_COMM_WORLD, &status1); MPI_Send(buffer, len, MPI_CHAR, (rank+1)%p, tag, MPI_COMM_WORLD); 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 10

Reporting and Displaying Performance Give the viewer sufficient information to Draw their own conclusions Reproduce your results Tabulate and display the results fairly Avoid misleading techniques See the Bailey paper for examples of how not to display and report performance data http://crd.lbl.gov/~dhbailey/dhbpapers/mislead.pdf 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 11

Challenges to measuring performance Reproducibility Transient system operating conditions Differing systems or program configuration Measurements are imprecise Heisenberg uncertainty principle: measurement technique may affect performance Overheads and inaccuracy Explain anomalous behavior, but ignore anomalies that are not significant 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 12

Complications Cost of measuring a full run is prohibitive Ignore startup code if you plan to run for a much longer time in production Transient behavior Repeat your measurements Warm up the coded before collecting measurements Ignore outliers unless their behavior is important to you Average time, maximum time, minimum time? 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 13

Measurement collection Report the best timings Repeat results 3 to 5 times until at least 2 measures agree to within 5%, 10% Report the minimum time Also report outliers A scatter plot or error bar can be useful 45 40 35 30 25 20 15 10 5 Compute Communicate 0 0 10 20 30 40 50 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 14

Measures of time Timing collection Elapsed, or wall clock time CPU time = system + user time Overhead, resolution, and quantization effects Measurement tools Unix time command does a reasonable job for long-running programs Hardware performance monitors System clocks Often platform dependent, especially library routines MPI provides MPI_Wtime( ), elapsed or wall clock time 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 15

Qualifying measurements Specify version and options Compiler Operating system Numerical libraries Establish appropriate operating conditions Program inputs System environment variables Dedicated system access uname -a Linux valkyrie.ucsd.edu 2.6.9-5.0.5.ELsmp #1 SMP Wed Apr 20 00:16:40 BST 2005 i686 i686 i386 GNU/Linux g++ -v Reading specs from /usr/lib/gcc/i386-redhatlinux/3.4.5/specs Configured with:. host=i386-redhat-linux Thread model: posix gcc version 3.4.5 20051201 (Red Hat 3.4.5-2) 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 16

A first application Compute a numerical approximation to the definite integral b a f(x) dx using the trapezoidal rule 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 17

How the trapezoidal rule works Divide the interval [a,b] into n segments of size h=1/n Approximate the area under an interval using a trapezoid Area under the i th trapezoid (f(a+i h)+f(a+(i+1) h)) h Area under the entire curve sum of all the trapezoids h a+i*h a+(i+1)*h 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 18

Reference material For a discussion of the trapezoidal rule http://metric.ma.ic.ac.uk/integration/techniques/definite/numericalmethods/trapezoidal-rule A applet to carry out integration http://www.csse.monash.edu.au/~lloyd/tildealgds/numerical/integration Code (from Pacheco hard copy text) PUB = /export/home/cs260x-public Serial Code PUB/Pacheco/ppmpi_c/chap04/serial.c Parallel Code PUB/Pacheco/ppmpi_c/chap04/trap.c 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 19

Serial code (Following Pacheco) main() { float f(float x) { return x*x; } // Function we're integrating float h = (b-a)/n; float integral = (f(a) + f(b))/2.0; // h = trapezoid base width // a and b: endpoints // n = # of trapezoids float x; int i; for (i = 1, x=a; i <= n-1; i++) { x += h; integral = integral + f(x); } integral = integral*h; } 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 20

The parallel algorithm Decompose the integration interval into subintervals, one per processor Each processor computes the integral on its local subdomain Processors combine their local integrals into a global one 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 21

First version of the parallel code local_n = n/p; // Number of trapezoids; assume p divides n evenly float local_a = a + my_rank*local_n*h, local_b = local_a + local_n*h, integral = Trap(local_a, local_b, local_n, h); if (my_rank == 0) { // Sum the integrals calculated by all the processes total = integral; for (source = 1; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT, source, tag, WORLD, &status); total += integral; } } else MPI_Send(&integral, 1, MPI_FLOAT, dest, tag, WORLD); 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 22

Improvements The result does not depend on the order in which the sums are taken We use a linear time algorithm to accumulate contributions, but there are other orderings for (source = 1; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT, MPI_ANY_SOURCE, tag, WORLD, &status); } total += integral; 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 23

Collective communication We can often improve performance by taking advantage of global knowledge about communication Instead of using point to point communication operations to accumulate the sum, use collective communication 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 24

Collective communication in MPI Collective operations are called by all processes within a communicator Broadcast: distribute data from a designated root process to all the others MPI_Bcast(in, count, type, root, comm) Reduce: combine data from all processes and return to a deisgated root process MPI_Reduce(in, out, count, type, op, root, comm) 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 25

Broadcast The root process transmits of m pieces of data to all the p-1 other processors With the linear ring algorithm this processor performs p-1 sends of length m Cost is (p-1)( + m) Another approach is to use the hypercube algorithm, which has a logarithmic running time 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 26

What is a hypercube? A hypercube is a d-dimensional graph with 2 d nodes A 0-cube is a single node, 1-cube is a line connecting two points, 2-cube is a square, etc Each node has d neighbors 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 27

Properties of hypercubes A hypercube with p nodes has lg(p) dimensions Inductive construction: we may construct a d-cube from two (d-1) dimensional cubes Diameter: What is the maximum distance between any 2 nodes? Bisection bandwidth: How many cut edges (mincut) 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 28 10 11 00 01

Bookkeeping Label nodes with a binary reflected grey code http://www.nist.gov/dads/html/graycode.html Neighboring labels differ in exactly one bit position 001 = 101 e 2, e 2 = 100 e 2 =100 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 29

Hypercube broadcast algorithm with p=4 Processor 0 is the root, sends its data to its hypercube buddy on processor 2 (10) Proc 0 & 2 send data to respective buddies 10 11 10 11 10 11 00 01 00 01 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 30

Reduction We may use the hypercube algorithm to perform reductions as well as broadcasts Use variant of reduction Allreduce( ) Everyone obtains a copy of the reduced result Equivalent to a Reduce( ) + Bcast( ) A clever algorithm performs an Allreduce in one phase rather than having perform separate reduce and broadcast phases 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 31

int local_n = n/p; Improved parallel code float local_a = a + my_rank*local_n*h, local_b = local_a + local_n*h, integral = Trap(local_a, local_b, local_n, h); MPI_Allreduce( &integral, &total, count=1, MPI_FLOAT, MPI_SUM, WORLD) 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 32

Iterative methods for solving systems of equations

Mesh based methods Many physical problems are simulated on a uniform mesh in 1, 2 or 3 dimensions Values are defined on a discrete set of points A mapping from ordered pairs to physical observables like temperature and pressure One application: differential equations 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 34

Differential equations A differential equation is a set of equations involving derivatives of a function (or functions), and specifies a solution to be determined under certain constraints Constraints often specify boundary conditions or initial values that the solution must satisfy When the functions have multiple variables we have a Partial Differential Equation (PDE) 2 u/ x 2 + 2 u/ y 2 = 0 within a square box, x,y [0,1] u(x,y) = sin(x)*sin(y) on, perimeter of the box When the functions have a single variable we have an Ordinary Differential Equation (ODE) -u (x) = f(x), x [0,1], u(0) = a, u(1) = b 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 35

Solving an ODE with a discrete approximation Solve the ODE -u (x) = f(x), x [0,1] Define u i,j = u(i h, j h) at points x =i h, y=j h, h=1/(n-1) Approximate the derivatives u (u(x+h) 2u(x) + u(x-h))/h 2 Obtain the system of equations (u i-1-2u i + u i+1 )/h 2 = f i i 1..n-2 h u i-1 u i u i+1 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 36

Iterative solution Rewrite the system of equations (-u i-1 + 2u i - u i+1 )/h 2 = f i, i 1..n-1 It can be shown that the following Gauss-Seidel algorithm will arrive at the solution. assuming an initial guess for the u i Repeat until the result is satisfactory for i = 1 : N-1 u i = (u i+1 +u i-1 +h 2 f i )/2 end for end Repeat 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 37

Convergence Convergence is slow We reach the desired precision in O(N 2 ) iterations 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 38

Estimating the error How do we know when the answer is good enough? The computed solution has reached a reasonable approximation to the exact solution We validate the computed solution in the field, i.e. wet lab experimentation But we often don t know the exact solution, and must estimate the error 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 39

Using the residual to estimate the error Recall the equations (-u i-1 + 2u i - u i+1 )/h 2 = f i, i 1..n-1 [Au = f ] Define the residual r i : [r = Au-f] r i = (-u i-1 + 2u i - u i+1 )/h 2 - f i, i 1..n-1 Thus, our computed solution is correct when r i = 0 We can obtain a good estimate of the error by finding the maximum r i i Another possibility is to take the root mean 2 square (L2 norm) r i 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 40 i

Parallel implementation The first step is to partition the data We partition the data into intervals, assigning each to a unique processor Many to one mappings may also be useful P0 P1 P2 P3 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 41

Data dependences Each interval depends on two endpoint values found on neighboring processes We add overlap or ghost cells to hold a copy of the off-processor value P0 P1 P2 P3 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 42

Dependences Our attempt to parallelize the algorithm fails, since there are loop carried dependences The value of u[i] computed in in iteration i depends on u[i] computed in iteratoin i-1 for i = 1 : N-1 u[i] = (u[i-1]+u[i+1] +h*h*f[i])/2 end for 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 43

Parallel implementation We avoid the dependences by renaming the LHS of the assignment Two arrays u and unew Update unew for i = 1 : N-1 unew[i] = (u[i-1]+u[i+1] +h*h*f[i])/2 end for Then swap u and unew 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 44

Tradeoffs We can now parallelize the algorithm, since there are no longer any loop carried dependencies within the i loop However, this change reduces the convergence rate by about a factor of two We have doubled the amount of work we need to perform in exchange for being able to realize parallel speedups This kind of tradeoff is common, but we can do better 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 45

Convergence check Each process computes the error for its assigned part of the problem We need a global error so that we compute a result that is consistent with the single processor implementation This requires collective communication: a reduction In MPI we use Allreduce, which leaves a copy the result in the memory of all processes 10/3/06 Scott B. Baden /CSE 260/ Fall 2006 46