Optimization Techniques for Parallel Code 1. Parallel programming models

Similar documents
EE/CSCI 451: Parallel and Distributed Computation

Welcome to MCS 572. content and organization expectations of the course. definition and classification

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Parallelism and Machine Models

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun

Parallelism in Computer Arithmetic: A Historical Perspective

Dense Arithmetic over Finite Fields with CUMODP

arxiv: v1 [hep-lat] 7 Oct 2010

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Lecture 4. Adders. Computer Systems Laboratory Stanford University

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Lecture 6 September 21, 2016

11 Parallel programming models

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Panorama des modèles et outils de programmation parallèle

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10,

Introduction to numerical computations on the GPU

R ij = 2. Using all of these facts together, you can solve problem number 9.

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

An Integrative Model for Parallelism

The Fast Multipole Method in molecular dynamics

CSED233: Data Structures (2017F) Lecture4: Analysis of Algorithms

COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College

CRYPTOGRAPHIC COMPUTING

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Perm State University Research-Education Center Parallel and Distributed Computing

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1

More Science per Joule: Bottleneck Computing

Computer Algorithms CISC4080 CIS, Fordham Univ. Outline. Last class. Instructor: X. Zhang Lecture 2

Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2

Arnaud Legrand, CNRS, University of Grenoble. November 1, LIG laboratory, Theoretical Parallel Computing. A.

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

High Performance Computing

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

Data Structures. Outline. Introduction. Andres Mendez-Vazquez. December 3, Data Manipulation Examples

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Implementing NNLO into MCFM

Lecture 26 December 1, 2015

Performance Evaluation of MPI on Weather and Hydrological Models

COL 730: Parallel Programming

Definition: Alternating time and space Game Semantics: State of machine determines who

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Friday Four Square! Today at 4:15PM, Outside Gates

Multicore Semantics and Programming

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

Solving PDEs with CUDA Jonathan Cohen

Problem-Solving via Search Lecture 3

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

ERLANGEN REGIONAL COMPUTING CENTER

Analytical Modeling of Parallel Systems

Performance Evaluation of Scientific Applications on POWER8

Lecture 19. Architectural Directions

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Performance Analysis of Lattice QCD Application with APGAS Programming Model

All of the above algorithms are such that the total work done by themisω(n 2 m 2 ). (The work done by a parallel algorithm that uses p processors and

Parallel Transposition of Sparse Data Structures

Computing Maximum Subsequence in Parallel

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

On The Energy Complexity of Parallel Algorithms

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Embedded Systems Design: Optimization Challenges. Paul Pop Embedded Systems Lab (ESLAB) Linköping University, Sweden

The Lattice Boltzmann Simulation on Multi-GPU Systems

Accelerating Molecular Modeling Applications with Graphics Processors

OPTIMIZING THE COMPUTATION OF EIGENVALUES USING GRAPHICS PROCESSING UNITS

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Complexity: Some examples

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Review: From problem to parallel algorithm

Advanced Restructuring Compilers. Advanced Topics Spring 2009 Prof. Robert van Engelen

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Overview: Synchronous Computations

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD

Compiling Techniques

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

RWTH Aachen University

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers

Practical Free-Start Collision Attacks on 76-step SHA-1

Accelerating linear algebra computations with hybrid GPU-multicore systems.

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Efficient algorithms for symmetric tensor contractions

How fast can we calculate?

Architecture Solutions for DDM Numerical Environment

Definition: Alternating time and space Game Semantics: State of machine determines who

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Transcription:

Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017

Goals of the course How to design parallel algorithms? How to implement them on parallel architectures (e.g. GPU)? How to get good performance? Will focus on NVIDIA CUDA environment Similar concepts in OpenCL Non-goals : we will not cover Graphics programming (OpenGL, Direct3D) High-level languages that target GPUs (OpenACC, OpenMP 4 ) Parallel programming for multi-core and clusters (OpenMP, MPI...) Programming for SIMD extensions (SSE, AVX...) 2

Why graphics processing unit (GPU)? GPU or GPU Graphics rendering accelerator for computer games Mass market: low unit price, amortized R&D Increasing programmability and flexibility Inexpensive, high-performance parallel processor GPUs are everywhere, from cell phones to supercomputers 3

GPUs in high-performance computing 2002: General-Purpose computation on GPU (GPGPU) Scientists run non-graphics computations on GPUs 2010s: the fastest supercomputers use GPUs/accelerators Today: Heterogeneous multi-core processors influenced by GPUs #1 Sunway TaihuLight (China) 40,960 SW26010 (4 big + 256 small cores) #2 Tianhe-2 (China) 16,000 (2 12-core Xeon + 3 57-core Xeon Phi) Being upgraded to Matrix-2000 coprocessors 4

Outline of the course Today: Introduction to parallel programming models PRAM BSP Multi-BSP GPU / SIMD accelerator programming The software side Programming model Performance optimization Memory access optimization Compute optimization Advanced features Warp-synchronous programming, thread groups Unified memory Legal disclaimer: Non-binding information, subject to change. 5

Introduction to parallel prog. models Today's course Goals: Be aware of the major parallel models and their characteristics Write platform-independent parallel algorithms Incomplete: just an overview, not an actual programming model course Just cover what we need for the rest of the course Informal: no formal definition, no proofs of theorems Proofs by handwaving 6

Outline of the lecture Why programming models? PRAM BSP Model Example Brent's theorem Motivation Description Example Multi-BSP Algorithm example: parallel prefix 7

Why programming models? Abstract away the complexity of the machine Abstract away implementation differences (languages, architectures, runtime...) Focus on the algorithmic choices First design the algorithm, then implement it Define the asymptotic algorithmic complexity of algorithms Compare algorithms independently of the implementation / language / computer / operating system... 8

Serial model: RAM/Von Neumann/Turing Abstract machine Program instructions Execute in sequence Machine Compute Jump Read, write Memory One unlimited memory, constant access time Basis for classical algorithmic complexity theory Complexity = number of instructions executed as a function of problem size 9

Serial algorithms Can you cite examples of serial algorithms and their complexity? 10

Serial algorithms Can you cite examples of serial algorithms and their complexity? Binary search: O(log n) Quick sort: O(n log n) Factorization of n-digit number: O(e n ) Multiplication of n-digit numbers: O(n log n) < O(n log n 2 O(log* n) ) < O(n log n log log n) Conjectured lower bound Best known algorithm Fürer, 2007 Previously best known algorithm Schönhage-Strassen, 1971 Still an active area of research! But for parallel machines, we need parallel models 11

Outline of the lecture Why programming models? PRAM BSP Model Example Brent's theorem Motivation Description Example Multi-BSP Algorithm example: parallel prefix 12

Parallel model: PRAM 1 memory, constant access time N processors Synchronous steps All processors run an instruction simultaneously Program instructions Execute in sequence Read, write P0 Processors P1... Pn Usually the same instruction Memory Each processors knows its own number S. Fortune and J. Wyllie. Parallelism in random access machines. Proc. ACM symposium on Theory of computing. 1978. 13

PRAM example: SAXPY Given vectors X and Y of size n, compute vector R a X+Y with n processors Processor Pi: xi X[i] yi Y[i] ri a xi + yi R[i] ri P0 P1 P2 P3 X Y R 14

PRAM example: SAXPY Given vectors X and Y of size n, compute vector R a X+Y with n processors Processor Pi: xi X[i] yi Y[i] ri a xi + yi R[i] ri P0 P1 P2 P3 X Y R 15

PRAM example Given vectors X and Y of size n, compute vector R a X+Y with n processors Processor Pi: xi X[i] yi Y[i] ri a xi + yi R[i] ri P0 P1 P2 P3 X Y R 16

PRAM example: SAXPY Given vectors X and Y of size n, compute vector R a X+Y with n processors Processor Pi: xi X[i] yi Y[i] ri a xi + yi R[i] ri P0 P1 P2 P3 X Y R Complexity: O(1) with n processors O(n) with 1 processor 17

Access conflicts: PRAM machine types Can multiple processors read the same memory location? Exclusive Read: any 2 processors have to read different locations Concurrent Read: 2 processors can access the same location Can multiple processors write the same location? Exclusive Write: any 2 processors have to write different locations Concurrent Write: Allow reduction operators: And, Or, Sum... Types of PRAM machines: EREW, CREW, CRCW... We assume CREW in this lecture X Y X X Y X Pi Pj Pi Pj 18

Reduction example Given a vector a of size n, compute the sum of elements with n/2 processors Sequential algorithm Dependency graph: for i = 0 to n-1 sum sum + a[i] a 0 a 1 a 2 a 3 Complexity: O(n) with 1 processor All operations are dependent: no parallelism Need to use associativity to reorder the sum 19

Parallel reduction: dependency graph We compute r=(...((a 0 a 1 ) (a 2 a 3 ))... a n-1 )...) Pairs Groups of 4 a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 r 20

Parallel reduction: PRAM algorithm First step P 0 P 1 P 2 P 3 Processor Pi: a[i] = a[2*i] + a[2*i+1] a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 0 a 1 a 2 a 3 a i a 2i +a 2i+1 a 0 a 1 a i a 2i +a 2i+1 r a i a 2i +a 2i+1 21

Parallel reduction: PRAM algorithm Second step Reduction on n/2 processors Other processors stay idle P 0 P 1 P 2 P 3 Processor Pi: n'=n/2 if i < n' then a[i] = a[2*i] + a[2*i+1] end if a 0 a 1 a 2 a 3 a 0 a 1 a i a 2i +a 2i+1 r a i a 2i +a 2i+1 22

Parallel reduction: PRAM algorithm Complete loop P 0 P 1 P 2 P 3 Processor Pi: while n > 1 do if i < n then a[i] = a[2*i] + a[2*i+1] end if n = n / 2; end while a 0 a 1 a 2 a 3 a 0 a 1 a 2 a 3 a 0 a 1 a 4 a 5 a 6 a 7 a i a 2i +a 2i+1 a i a 2i +a 2i+1 Complexity: O(log(n)) with O(n) processors r a i a 2i +a 2i+1 23

Less than n processors A computer that grows with problem size is not very realistic Assume p processors for n elements, p < n Back to R a X+Y example Solution 1: group by blocks assign ranges to each processor Processor Pi: for j = i* n/p to (i+1)* n/p -1 xj X[j] yj Y[j] rj a xj + yj R[j] rj P0 X P0 P1 P2 P3 P2 Y Step 1 R 24

Less than n processors A computer that grows with problem size is not very realistic Assume p processors for n elements, p < n Back to R a X+Y example Solution 1: group by blocks assign ranges to each processor Processor Pi: for j = i* n/p to (i+1)* n/p -1 xj X[j] yj Y[j] rj a xj + yj R[j] rj P0 X P0 P1 P2 P3 P2 Y Step 2 R 25

Solution 2: slice and interleave Processor Pi: for j = i to n step p xj X[j] yj Y[j] rj a xj + yj R[j] rj P0 P1 P2 P3 X Y Step 1 R 26

Solution 2: slice and interleave Processor Pi: for j = i to n step p xj X[j] yj Y[j] rj a xj + yj R[j] rj P0 P1 P2 P3 X Y Step 2 R Equivalent in the PRAM model But not in actual programming! 27

Brent's theorem Generalization for any PRAM algorithm of complexity C(n) with n processors Derive an algorithm for n elements, with p processors only Idea: each processor emulates n/p virtual processors Each emulated step takes n/p actual steps Complexity: O(n/p * C(n)) n P0 P1 P2 P3 P0 P1 1 p n/p 28

Interpretation of Brent's theorem We can design algorithms for a variable (unlimited) number of processors They can run on any machine with fewer processors with known complexity There may exist a more efficient algorithm with p processors only 29

PRAM: wrap-up Unlimited parallelism Unlimited shared memory Only count operations: zero communication cost Pros Cons Simple model: high-level abstraction can model many processors or even hardware Simple model: far from implementation 30

Outline of the lecture Why programming models? PRAM BSP Model Example Brent's theorem Motivation Description Example Multi-BSP Algorithm example: parallel prefix 31

PRAM limitations PRAM model proposed in 1978 Inspired by SIMD machines of the time Assumptions All processors synchronized every instruction Negligible communication latency Useful as a theoretical model, but far from modern computers ILLIAC-IV, an early SIMD machine 32

Modern architectures Modern supercumputers are clusters of computers Global synchronization costs millions of cycles Memory is distributed Inside each node Multi-core CPUs, GPUs Non-uniform memory access (NUMA) memory Synchronization cost at all levels Mare Nostrum, a modern distributed memory machine 33

Bulk-Synchonous Parallel (BSP) model Assumes distributed memory But also works with shared memory Good fit for GPUs too, with a few adaptations Processors execute instructions independently Communications between processors are explicit Processors need to synchronize with each other Program Processors Read, write Local memory Communication network + Synchronization P0 instructions P1 Execute independently... Pn L. Valiant. A bridging model for parallel computation. Comm. ACM 1990. 34

Superstep A program is a sequence of supersteps Superstep: each processor Computes Sends result Receive data Barrier: wait until all processors have finished their superstep Next superstep: can use data received in previous step Superstep Barrier Superstep Computation+ communication 35

Example: reduction in BSP Start from dependency graph again a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 r 36

Reduction: BSP Adding barriers P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 0 a 1 a 2 a 3 a i a 2i +a 2i+1 Communication Barrier a 0 a 1 a i a 2i +a 2i+1 Barrier Barrier r a i a 2i +a 2i+1 37

Reducing communication Data placement matters in BSP Optimization: keep left-side operand local P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 2i a 2i +a 2i+1 a 0 a 2 a 4 a 6 Communication Barrier a 4i a 4i +a 4i+2 a 0 a 4 Barrier a 8i a 8i +a 8i+4 r Barrier 38

Wrapup BSP takes synchronization cost into account Explicit communications Homogeneous communication cost: does not depend on processors 39

Outline of the lecture Why programming models? PRAM BSP Model Example Brent's theorem Motivation Description Example Multi-BSP Algorithm examples 40

The world is not flat It is hierarchical! 41

Multi-BSP model Multi-BSP: BSP generalization with groups of processors in multiple nested levels Level 2 superstep Level 1 superstep Level 1 Barrier Level 2 Barrier Higher level: more expensive synchronization Arbitrary number of levels L. Valiant. A bridging model for multi-core computing. Algorithms-ESA 2008. 42

Multi-BSP and multi-core Minimize communication cost on hierarchical platforms Make parallel program hierarchical too Take thread affinity into account L1superstep Computer Program Thread Core L1 cache Shared L2 cache Shared L3 cache L2 superstep L3 superstep On clusters (MPI): add more levels up On GPUs (CUDA): add more levels down 43

Reduction: multi-bsp Break into 2 levels P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 a 4 a 5 a 6 a 7 a 0 a 1 a 2 a 3 L1 barrier a 0 a 2 a 4 a 6 L1 barrier Level 1 reduction a 0 a 4 L2 Barrier Level 2 reduction r 44

Recap PRAM BSP Single shared memory Many processors in lockstep Distributed memory, message passing Synchronization with barriers Multi-BSP BSP with multiple scales More abstract Closer to implementation 45

Outline of the lecture Why programming models? PRAM BSP Model Example Brent's theorem Motivation Description Example Multi-BSP Algorithm examples Parallel prefix Stencil 46

Work efficiency Work: number of operations Work efficiency: ratio between work of sequential algorithm and work of parallel algorithm Example: reduction Sequential sum: n-1 operations Parallel reduction: n-1 operations Work efficiency: 1 This is the best case! 47

Parallel prefix Problem: on a given vector a, for each element, compute the sum s i of all preceding elements i s = i k =0 a k Result: vector s = (s 0, s 1, s n-1 ) Sum can be any associative operator Applications: stream compaction, graph algorithms, addition in hardware Mark Harris, Shubhabrata Sengupta, and John D. Owens. "Parallel prefix sum (scan) with CUDA." GPU gems 3, 2007 48

Sequential prefix: graph Sequential algorithm Dependency graph: sum 0 for i = 0 to n do s[i] sum sum sum + a[i] end for a 0 a 1 a 2 a 3 Can we parallelize it? s 0 s 1 s 2 s 3 49

Sequential prefix: graph Sequential algorithm Dependency graph: sum 0 for i = 0 to n do s[i] sum sum sum + a[i] end for a 0 a 1 a 2 a 3 Can we parallelize it? Yes s 0 s 1 s 2 s 3 Add more computations to reduce the depth of the graph 50

Parallel prefix Optimize for graph depth Idea: Make a reduction tree for each output a 0 a 1 a 2 a 3 Reuse common intermediate results s 0 s 1 s 2 s 3 51

A simple parallel prefix algorithm Also known as Kogge-Stone in hardware Step 0 Step 1 Step 2 P0 P1 P2 P3 P4 P5 P6 P7 a0 a1 a2 a3 a4 a5 a6 a7 a0 Σ0-1 Σ1-2 Σ2-3 Σ3-4 Σ4-5 Σ5-6 Σ6-7 a0 Σ0-1 Σ0-2 Σ0-3 Σ1-4 Σ2-5 Σ3-6 Σ4-7 a0 Σ0-1 Σ0-2 Σ0-3 Σ0-4 Σ0-5 Σ0-6 Σ0-7 s[i] a[i] if i 1 then s[i] s[i-1] + s[i] if i 2 then s[i] s[i-2] + s[i] if i 4 then s[i] s[i-4] + s[i] j Σi-j is the sum k=i a k Step d: if i 2 d then s[i] s[i-2 d ] + s[i] Hillis, W. Daniel, and Guy L. Steele Jr. "Data parallel algorithms." Communications of the ACM 29.12 (1986): 1170-1183. 52

Analysis Complexity log 2 (n) steps with n processors Work efficiency O(n log 2 (n)) operations 1/log 2 (n) work efficiency All communications are global Bad fit for Multi-BSP model 53

Another parallel prefix algorithm Two passes First, upsweep: parallel reduction Compute partial sums Keep intermediate (non-root) results d=2 d=1 d=0 P0 P1 P2 P3 P4 P5 P6 P7 a0 Σ0-1 a2 Σ0-3 a4 Σ4-5 a6 Σ0-7 a0 Σ0-1 a2 Σ0-3 a4 Σ4-5 a6 Σ4-7 a0 Σ0-1 a2 Σ2-3 a4 Σ4-5 a6 Σ6-7 a0 a1 a2 a3 a4 a5 a6 a7 Guy E. Blelloch, Prefix sums and their applications, Tech. Report, 1990. for d = 0 to log 2 (N)-1 if i mod 2 d+1 = 2 d+1-1 then s[i] s[i+2 d ] + s[i] end if end for 54

Second pass: downsweep Set root to zero Go down the tree Left child: propagate parent Right child: compute parent + left child In parallel d=2 d=1 d=0 P0 P1 P2 P3 P4 P5 P6 P7 a0 Σ0-1 a2 Σ0-3 a4 Σ4-5 a6 0 a0 Σ0-1 a2 0 a4 Σ4-5 a6 a0 0 a2 Σ0-1 a4 Σ0-3 a6 Σ0-3 Σ0-5 0 a0 Σ0-1 Σ0-2 Σ0-3 Σ0-4 Σ0-5 Σ0-6 Hopefully nobody will notice the off-by-one difference... s[n-1] 0 for d = log 2 (n)-1 to 0 step -1 if i mod 2 d+1 = 2 d+1-1 then x s[i] s[i] s[i-2 d ] + x s[i-2 d ] x end if end for 55

Analysis Complexity 2 log 2 (n) steps: O(log(n)) Work efficiency Performs 2 n operations Work efficiency = 1/2: O(1) Communications are local away from the root Efficient algorithm in Multi-BSP model 56

Another variation: Brent-Kung Same reduction in upsweep phase, simpler computation in downsweep phase Many different topologies: Kogge-Stone, Ladner-Fischer, Han-Carlson... Different tradeoffs, mostly relevant for hardware implementation P0 P1 P2 P3 P4 P5 P6 P7 a0 Σ0-1 a2 Σ0-3 a4 Σ4-5 a6 Σ0-7 a0 Σ0-1 a2 Σ0-3 a4 Σ0-5 a6 Σ0-7 if i mod 2 d+1 = 2 d+1-2 d then s[i] s[i-2 d ] + s[i] end if a0 Σ0-1 Σ0-2 Σ0-3 Σ0-4 Σ0-5 Σ0-6 Σ0-7 57

Outline of the lecture Why programming models? PRAM BSP Model Example Brent's theorem Motivation Description Example Multi-BSP Algorithm examples Parallel prefix Stencil 58

Stencil Given a n-d grid, compute each element as a function of its neighbors All updates done in parallel: no dependence Commonly, repeat the stencil operation in a loop Applications: image processing, fluid flow simulation 59

Stencil: naïve parallel algorithm 1-D stencil of size n, nearest neighbors m successive steps P0 P1 P2 P3 Processor i: for t = 0 to m s[i] = f(s[i-1], s[i], s[i+1]) end for f f f f t = 0 t = 1 t = 2 Needs global synchronization at each stage Can you do better? 60

Ghost zone optimization Idea: trade more computation for less global communication Affect tiles to groups of processors with overlap Processors P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 Data s0 s1 s2 s3 s4 s5 s6 s7 s8 s5 s6 P11 P12 P13P14 P15 P16 P17 s7 s8 s9 s10 s11 s12 s13 Overlap: Ghost zone Memory for group P0-P8 Memory for group P9-P17 61

Ghost zone optimization Initialize all elements including ghost zones Perform local stencil computations Global synchronization only needed periodically t=0 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 s0 s1 s2 s3 s4 s5 s6 s7 s8 s5 s6 P11 P12 P13P14 P15 P16 P17 s7 s8 s9 s10 s11 s12 s13 t=1 Communication + Synchronization t=2 s0 s1 s7 s8 s5 s6 Ghost zone s12 s13 62

Wrapup Computation is cheap Focus on communication and global memory access Between communication/synchronization steps, this is familiar sequential programming Local communication is cheap Focus on global communication Next time: CUDA programming 63