Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms

Similar documents
1 / 28. Parallel Programming.

Program Performance Metrics

Parallel Program Performance Analysis

On Embeddings of Hamiltonian Paths and Cycles in Extended Fibonacci Cubes

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Parallel Numerical Algorithms

EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs)

Network Congestion Measurement and Control

Tight Bounds on the Diameter of Gaussian Cubes

CSE613: Parallel Programming, Spring 2012 Date: May 11. Final Exam. ( 11:15 AM 1:45 PM : 150 Minutes )

Tight Bounds on the Ratio of Network Diameter to Average Internode Distance. Behrooz Parhami University of California, Santa Barbara

On Detecting Multiple Faults in Baseline Interconnection Networks

A Fault-tolerant Routing Strategy for Gaussian Cube Using Gaussian Tree

Pancycles and Hamiltonian-Connectedness of the Hierarchical Cubic Network

Lecture 4. Writing parallel programs with MPI Measuring performance

Data Gathering and Personalized Broadcasting in Radio Grids with Interferences

Overlay networks maximizing throughput

Gossip Latin Square and The Meet-All Gossipers Problem

Parallel Prefix Algorithms 1. A Secret to turning serial into parallel

CSE Introduction to Parallel Processing. Chapter 2. A Taste of Parallel Algorithms

Network Algorithms and Complexity (NTUA-MPLA) Reliable Broadcast. Aris Pagourtzis, Giorgos Panagiotakos, Dimitris Sakavalas

High Performance Computing

arxiv: v1 [cs.gr] 25 Nov 2015

Fast Matrix Multiplication Over GF3

New Attacks on the Concatenation and XOR Hash Combiners

(a) (b) (c) (d) (e) (f) (g)

Data Gathering and Personalized Broadcasting in Radio Grids with Interferences

Parallelization of the QC-lib Quantum Computer Simulator Library

ODD EVEN SHIFTS IN SIMD HYPERCUBES 1

Lecture 4: Divide and Conquer: van Emde Boas Trees

EE/CSCI 451: Parallel and Distributed Computation

Solution of Linear Systems

Multi-join Query Evaluation on Big Data Lecture 2

The Complexity of a Reliable Distributed System

Concurrent Counting is harder than Queuing

ARecursive Doubling Algorithm. for Solution of Tridiagonal Systems. on Hypercube Multiprocessors

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

Image Reconstruction And Poisson s equation

Constant Space and Non-Constant Time in Distributed Computing

Romberg Integration. MATH 375 Numerical Analysis. J. Robert Buchanan. Spring Department of Mathematics

NP and Computational Intractability

BINARY CODES. Binary Codes. Computer Mathematics I. Jiraporn Pooksook Department of Electrical and Computer Engineering Naresuan University

Fast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array With a Reconfigurable Pipelined Bus System

All of the above algorithms are such that the total work done by themisω(n 2 m 2 ). (The work done by a parallel algorithm that uses p processors and

5 Binary to Gray and Gray to Binary converters:

Multimedia Systems WS 2010/2011

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Secure RAID Schemes from EVENODD and STAR Codes

EECS 70 Discrete Mathematics and Probability Theory Fall 2015 Walrand/Rao Final

Linear Codes, Target Function Classes, and Network Computing Capacity

Error Detection and Correction: Hamming Code; Reed-Muller Code

CSE 123: Computer Networks

Problem 1. k zero bits. n bits. Block Cipher. Block Cipher. Block Cipher. Block Cipher. removed

Outline. EECS Components and Design Techniques for Digital Systems. Lec 18 Error Coding. In the real world. Our beautiful digital world.

Modeling Network Optimization Problems

Practicality of Large Scale Fast Matrix Multiplication

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Parallelization of the QC-lib Quantum Computer Simulator Library

Trivially parallel computing

Cyclic Redundancy Check Codes

channel of communication noise Each codeword has length 2, and all digits are either 0 or 1. Such codes are called Binary Codes.

ECC for NAND Flash. Osso Vahabzadeh. TexasLDPC Inc. Flash Memory Summit 2017 Santa Clara, CA 1

An Algebraic Approach to Network Coding

Evolutionary Tree Analysis. Overview

Can You Hear Me Now?

Agreement. Today. l Coordination and agreement in group communication. l Consensus

9 Forward-backward algorithm, sum-product on factor graphs

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Computational Numerical Integration for Spherical Quadratures. Verified by the Boltzmann Equation

Error Detection and Correction: Small Applications of Exclusive-Or

A Nested Dissection Parallel Direct Solver. for Simulations of 3D DC/AC Resistivity. Measurements. Maciej Paszyński (1,2)

Lecture 7: Fingerprinting. David Woodruff Carnegie Mellon University

Broadcast EncrypCon Amos Fiat & Moni Naor

Entanglement and information

MATH3302 Coding Theory Problem Set The following ISBN was received with a smudge. What is the missing digit? x9139 9

Distributed Data Fusion with Kalman Filters. Simon Julier Computer Science Department University College London

Timing Results of a Parallel FFTsynth

An Introduction to Z3

Leader Election in Asymmetric Labeled Unidirectional Rings

4th year Project demo presentation

Parallel Genetic Algorithms

2. Polynomials. 19 points. 3/3/3/3/3/4 Clearly indicate your correctly formatted answer: this is what is to be graded. No need to justify!

EECS Components and Design Techniques for Digital Systems. Lec 26 CRCs, LFSRs (and a little power)

Quantum Wireless Sensor Networks

Gray Codes for Torus and Edge Disjoint Hamiltonian Cycles Λ

Computational Boolean Algebra. Pingqiang Zhou ShanghaiTech University

Outline. EECS150 - Digital Design Lecture 26 Error Correction Codes, Linear Feedback Shift Registers (LFSRs) Simple Error Detection Coding

Online Packet Routing on Linear Arrays and Rings

Class Notes: Solving Simultaneous Linear Equations by Gaussian Elimination. Consider a set of simultaneous linear equations:

Combinatorial algorithms

Recursion and Induction

Graph-based codes for flash memory

Algebra. Modular arithmetic can be handled mathematically by introducing a congruence relation on the integers described in the above example.

Continuous-Model Communication Complexity with Application in Distributed Resource Allocation in Wireless Ad hoc Networks

Coset Decomposition Method for Decoding Linear Codes

Calculating Algebraic Signatures Thomas Schwarz, S.J.

Recursion and Induction

Low-density parity-check codes

Transcription:

Algorithms for Collective Communication Design and Analysis of Parallel Algorithms

Source A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, Chapter 4, 2003.

Outline One-to-all broadcast All-to-one reduction All-to-all broadcast All-to-all reduction All-reduce Prefix sum Scatter Gather All-to-all personalized Improved one-to-all broadcast Improved all-to-one reduction Improved all-reduce

Corresponding MPI functions Operation One-to-all broadcast All-to-one reduction All-to-all broadcast All-to-all reduction All-reduce Prefix sum Scatter Gather All-to-all personalized MPI function[s] MPI_Bcast MPI_Reduce MPI_Allgather[v] MPI_Reduce_scatter[_block] MPI_Allreduce MPI_Scan / MPI_Exscan MPI_Scatter[v] MPI_Gather[v] MPI_Alltoall[v w]

Topologies 7 6 5 4 0 1 2 3 6 7 8 6 7 3 4 5 2 3 4 5 0 1 2 0 1

Linear model of communication overhead Point-to-point message takes time t s + t w m t s is the latency t w is the per-word transfer time (inverse bandwidth) m is the message size in # words (Must use compatible units for m and t w )

Contention Assuming bi-directional links Each node can send and receive simultaneously Contention if link is used by more than one message k-way contention means t w t w /k

One-to-all broadcast M M M M M Input: The message M is stored locally on the root Output: The message M is stored locally on all processes

One-to-all broadcast Ring 3 3 2 7 6 5 4 1 0 1 2 3 3 2 3 Recursive doubling Double the number of active processes in each step

One-to-all broadcast Mesh Use ring algorithm on the root s mesh row Use ring algorithm on all mesh columns in parallel 0 1 2 3 4 5 6 7 8 9 a b c d e f 1 2 2 3 4 4 3 4 4 3 4 4 3 4 4

One-to-all broadcast Hypercube Generalize mesh algorithm to d dimensions 2 3 2 6 7 3 2 3 4 5 1 3 0 1 3

One-to-all broadcast Algorithm The algorithms described above are identical on all three topologies 1: Assume that p = 2 d 2: mask 2 d 1 (set all bits) 3: for k = d 1, d 2,..., 0 do 4: mask mask XOR 2 k (clear bit k) 5: if me AND mask = 0 then 6: (lower k bits of me are 0) 7: partner me XOR 2 k (partner has opposite bit k) 8: if me AND 2 k = 0 then 9: Send M to partner 10: else 11: Receive M from partner 12: end if 13: end if 14: end for

One-to-all broadcast The given algorithm is not general. What if p 2 d? Set d = log 2 p and don t communicate if partner p What if the root is not process 0? Relabel the processes: me me XOR root

One-to-all broadcast Number of steps: d = log 2 p Time per step: t s + t w m Total time: (t s + t w m) log 2 p In particular, note that broadcasting to p 2 processes is only twice as expensive as broadcasting to p processes (log 2 p 2 = 2 log 2 p)

All-to-one reduction M 0 M 1 M 2 M 3 M M := M 0 M 1 M 2 M 3 Input: The p messages M k for k = 0, 1,..., p 1 The message M k is stored locally on process k An associative reduction operator E.g., {+,, max, min} Output: The sum M := M 0 M 1 M p 1 stored locally on the root

All-to-one reduction Algorithm Analogous to all-to-one broadcast algorithm Analogous time (plus the time to compute a b) Reverse order of communications Reverse direction of communications Combine incoming message with local message using

All-to-all broadcast M 3 M 2 M 1 M 0 M 1 M 2 M 3 M 0 M 3 M 2 M 1 M 0 M 3 M 2 M 1 M 0 M 3 M 2 M 1 M 0 Input: The p messages M k for k = 0, 1,..., p 1 The message M k is stored locally on process k Output: The p messages M k for k = 0, 1,..., p 1 are stored locally on all processes

All-to-all broadcast Ring Step 1 (7) 7 (7) (6) 6 (6) (5) (0) (1) (2) (3) 0 1 2 3 5 (5) (4) (4) 4 (3) (0) (1) (2) Step 2 (6) (5) (4) (3) 7 6 5 4 (7, 6) (6, 5) (5, 4) (4, 3) (0, 7) (1, 0) (2, 1) (3, 2) 0 1 2 3 (2) (7) (0) (1) and so on...

All-to-all broadcast Ring algorithm 1: left (me 1) mod p 2: right (me + 1) mod p 3: result M me 4: M result 5: for k = 1, 2,..., p 1 do 6: Send M to right 7: Receive M from left 8: result result M 9: end for The send is assumed to be non-blocking Lines 6 7 can be implemented via MPI_Sendrecv

All-to-all broadcast Time of ring algorithm Number of steps: p 1 Time per step: t s + t w m Total time: (p 1)(t s + t w m)

All-to-all broadcast Mesh algorithm The mesh algorithm is based on the ring algorithm: Apply the ring algorithm to all mesh rows in parallel Apply the ring algorithm to all mesh columns in parallel

All-to-all broadcast Time of mesh algorithm (Assuming a p p mesh for simplicity) Apply the ring algorithm to all mesh rows in parallel Number of steps: p 1 Time per step: ts + t w m Total time: ( p 1)(ts + t w m) Apply the ring algorithm to all mesh columns in parallel Number of steps: p 1 Time per step: ts + t w pm Total time: ( p 1)(ts + t w pm) Total time: 2( p 1)t s + (p 1)t w m

All-to-all broadcast Hypercube algorithm The hypercube algorithm is also based on the ring algorithm: For each dimension d of the hypercube in sequence: Apply the ring algorithm to the 2 d 1 links in the current dimension in parallel 6 7 6 7 6 7 2 3 2 3 2 3 4 5 4 5 4 5 0 1 0 1 0 1

All-to-all broadcast Time of hypercube algorithm Number of steps: d = log 2 p Time for step k = 0, 1,..., d 1: t s + t w 2 k m Total time: d 1 k=0 (t s + t w 2 k m) = t s log 2 p + t w (p 1)m

All-to-all broadcast Summary Topology t s t w Ring p 1 (p 1)m Mesh 2( p 1) (p 1)m Hypercube log 2 p (p 1)m Same transfer time (t w term) But the number of messages differ

All-to-all reduction M 0,3 M 0,2 M 0,1 M 0,0 M 1,3 M 1,2 M 1,1 M 1,0 M 2,3 M 2,2 M 2,1 M 2,0 M 3,3 M 3,2 M 3,1 M 3,0 M 0 M 1 M 2 M 3 M r := M 0,r M 1,r M 2,r M 3,r Input: The p 2 messages M r,k for r, k = 0, 1,..., p 1 The message M r,k is stored locally on process r An associative reduction operator Output: The sum M r := M 0,r M 1,r M p 1,r stored locally on each process r

All-to-all reduction Algorithm Analogous to all-to-all broadcast algorithm Analogous time (plus the time for computing a b) Reverse order of communications Reverse direction of communications Combine incoming message with part of local message using

All-reduce M 0 M 1 M 2 M 3 M M M M M := M 0 M 1 M 2 M 3 Input: The p messages M k for k = 0, 1,..., p 1 The message M k is stored locally on process k An associative reduction operator Output: The sum M := M 0 M 1 M p 1 stored locally on all processes

All-reduce Algorithm Analogous to all-to-all broadcast algorithm Combine incoming message with local message using Cheaper since the message size does not grow Total time: (t s + t w m) log 2 p

Prefix sum M 0 M 1 M 2 M 3 M (0) M (1) M (2) M (3) M (k) := M 0 M 1 M k Input: The p messages M k for k = 0, 1,..., p 1 The message M k is stored locally on process k An associative reduction operator Output: The sum M (k) := M 0 M 1 M k stored locally on process k for all k

Prefix sum Algorithm Analogous to all-reduce algorithm Analogous time Locally store only the corresponding partial sum

Scatter M 3 M 2 M 1 M 0 M 0 M 1 M 2 M 3 Input: The p messages M k for k = 0, 1,..., p 1 stored locally on the root Output: The message M k stored locally on process k for all k

Scatter Algorithm Analogous to one-to-all broadcast algorithm Send half of the messages in the first step, send one quarter in the second step, and so on More expensive since several messages are sent in each step Total time: t s log 2 p + t w (p 1)m

Gather M 0 M 1 M 2 M 3 M 3 M 2 M 1 M 0 Input: The p messages M k for k = 0, 1,..., p 1 The message M k is stored locally on process k Output: The p messages M k stored locally on the root

Gather Algorithm Analogous to scatter algorithm Analogous time Reverse the order of communications Reverse the direction of communications

All-to-all personalized M 0,3 M 0,2 M 0,1 M 0,0 M 1,3 M 1,2 M 1,1 M 1,0 M 2,3 M 2,2 M 2,1 M 2,0 M 3,3 M 3,2 M 3,1 M 3,0 M 3,0 M 3,1 M 3,2 M 3,3 M 2,0 M 2,1 M 2,2 M 2,3 M 1,0 M 1,1 M 1,2 M 1,3 M 0,0 M 0,1 M 0,2 M 0,3 Input: The p 2 messages M r,k for r, k = 0, 1,..., p 1 The message M r,k is stored locally on process r Output: The p messages M r,k stored locally on process k for all k

All-to-all personalized Summary Topology t s t w Ring p 1 (p 1)mp/2 Mesh 2( p 1) p( p 1)m Hypercube log 2 p m(p/2) log 2 p The hypercube algorithm is not optimal with respect to communication volume (the lower bound is t w m(p 1))

All-to-all personalized An optimal (w.r.t. volume) hypercube algorithm Idea: Let each pair of processes exchange messages directly Time: (p 1)(t s + t w m) Q: In which order do we pair the processes? A: In step k, let me exchange messages with me XOR k This can be done without contention!

All-to-all personalized An optimal hypercube algorithm 1 2 3 4 5 6 7

All-to-all personalized An optimal hypercube algorithm based on E-cube routing 6 7 6 7 6 7 2 3 2 3 2 3 4 5 4 5 4 5 0 1 0 1 0 1 6 7 6 7 6 7 2 3 2 3 2 3 4 5 4 5 4 5 0 1 0 1 0 1 6 7 2 3 4 5 0 1

All-to-all personalized E-cube routing Routing from s to t := s XOR k in step k The difference between s and t is s XOR t = s XOR(s XOR k) = k The number of links to traverse equals the number of 1 s in the binary representation of k (the so-called Hamming distance) E-cube routing: route through the links according to some fixed (arbitrary) ordering imposed on the dimensions

All-to-all personalized E-cube routing Why does E-cube routing work? Write such that k = k 1 XOR k 2 XOR XOR k n kj has exactly one set bit ki k j for all i j Step i: r r XOR k i and hence uses the links in one dimension without congestion. After all n steps we have as desired: r r XOR k 1 XOR XOR k n = r XOR k

All-to-all personalized E-cube routing example Route from s = 100 2 to t = 001 2 = s XOR 101 2 Hamming distance (i.e., # links): 2 Write k = k 1 XOR k 2 = 001 2 XOR 100 2 E-cube route: t = 100 2 101 2 001 2 = s

Summary Hypercube Operation Time One-to-all broadcast (t s + t w m) log 2 p All-to-one reduction (t s + t w m) log 2 p All-reduce (t s + t w m) log 2 p Prefix sum (t s + t w m) log 2 p All-to-all broadcast t s log 2 p + t w (p 1)m All-to-all reduction t s log 2 p + t w (p 1)m Scatter t s log 2 p + t w (p 1)m Gather t s log 2 p + t w (p 1)m All-to-all personalized (t s + t w m)(p 1)

Improved one-to-all broadcast 1. Scatter 2. All-to-all broadcast

Improved one-to-all broadcast Time analysis Old algorithm: Total time: (t s + t w m) log 2 p New algorithm: Scatter: t s log 2 p + t w (p 1)(m/p) All-to-all broadcast: t s log 2 p + t w (p 1)(m/p) Total time: 2t s log 2 p + 2t w (p 1)(m/p) 2t s log 2 p + 2t w m Effect: t s term: twice as large t w term: reduced by a factor (log 2 p)/2

Improved all-to-one reduction 1. All-to-all reduction 2. Gather

Improved all-to-one reduction Time analysis Analogous to improved one-to-all broadcast t s term: twice as large t w term: reduced by a factor (log 2 p)/2

Improved all-reduce All-reduce = One-to-all reduction + All-to-one broadcast 1. All-to-all reduction 2. Gather 3. Scatter 4. All-to-all broadcast...but gather followed by scatter cancel out!

Improved all-reduce 1. All-to-all reduction 2. All-to-all broadcast

Improved all-reduce Time analysis Analogous to improved one-to-all broadcast t s term: twice as large t w term: reduced by a factor (log 2 p)/2