Program Performance Metrics

Similar documents
Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

FORSCHUNGSZENTRUM JÜLICH GmbH Zentralinstitut für Angewandte Mathematik D Jülich, Tel. (02461)

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

A Nested Dissection Parallel Direct Solver. for Simulations of 3D DC/AC Resistivity. Measurements. Maciej Paszyński (1,2)

EECS 358 Introduction to Parallel Computing Final Assignment

1 / 28. Parallel Programming.

On Embeddings of Hamiltonian Paths and Cycles in Extended Fibonacci Cubes

CSE Introduction to Parallel Processing. Chapter 2. A Taste of Parallel Algorithms

CSE613: Parallel Programming, Spring 2012 Date: May 11. Final Exam. ( 11:15 AM 1:45 PM : 150 Minutes )

Distributed Systems Byzantine Agreement

Parallelization of the QC-lib Quantum Computer Simulator Library

Tight Bounds on the Ratio of Network Diameter to Average Internode Distance. Behrooz Parhami University of California, Santa Barbara

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Lecture 4. Writing parallel programs with MPI Measuring performance

Timing Results of a Parallel FFTsynth

CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication

Analytical Modeling of Parallel Systems

High Performance Computing

All of the above algorithms are such that the total work done by themisω(n 2 m 2 ). (The work done by a parallel algorithm that uses p processors and

Performance and Scalability. Lars Karlsson

Overview: Synchronous Computations

ODD EVEN SHIFTS IN SIMD HYPERCUBES 1

ECE 669 Parallel Computer Architecture

Agreement Protocols. CS60002: Distributed Systems. Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur

Review for the Midterm Exam

Conquering Edge Faults in a Butterfly with Automorphisms

Online Packet Routing on Linear Arrays and Rings

Embeddings, Fault Tolerance and Communication Strategies in k-ary n-cube Interconnection Networks

Data Gathering and Personalized Broadcasting in Radio Grids with Interferences

Big Data Analytics. Lucas Rego Drumond

Class Field Theory. Steven Charlton. 29th February 2012

Early stopping: the idea. TRB for benign failures. Early Stopping: The Protocol. Termination

Data Gathering and Personalized Broadcasting in Radio Grids with Interferences

Parallel Performance Theory - 1

Algorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen

Parallel Scientific Computing

Divisible Job Scheduling in Systems with Limited Memory. Paweł Wolniewicz

Parallel Genetic Algorithms

A Tunable Mechanism for Identifying Trusted Nodes in Large Scale Distributed Networks

Concurrent Counting is harder than Queuing

Continuing discussion of CRC s, especially looking at two-bit errors

Fault-Tolerant Consensus

0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA

Section 6 Fault-Tolerant Consensus

SDS developer guide. Develop distributed and parallel applications in Java. Nathanaël Cottin. version

Tight Bounds on the Diameter of Gaussian Cubes

Gray Codes for Torus and Edge Disjoint Hamiltonian Cycles Λ

CSE 140 Lecture 11 Standard Combinational Modules. CK Cheng and Diba Mirza CSE Dept. UC San Diego

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Efficient Notification Ordering for Geo-Distributed Pub/Sub Systems

3/11/18. Final Code Generation and Code Optimization

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *

Consensus when failstop doesn't hold

3D Parallel Elastodynamic Modeling of Large Subduction Earthquakes

Formal Verification of Mobile Network Protocols

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Parallel Program Performance Analysis

Closed Form Bounds for Clock Synchronization Under Simple Uncertainty Assumptions

Andrew Morton University of Waterloo Canada

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

CS 170 Algorithms Fall 2014 David Wagner MT2

Counting in Practical Anonymous Dynamic Networks is Polynomial

Multipole-Based Preconditioners for Sparse Linear Systems.

Hw 6 due Thursday, Nov 3, 5pm No lab this week

EE/CSCI 451: Parallel and Distributed Computation

Chapter 7. Sequential Circuits Registers, Counters, RAM

Parallel Performance Theory

TIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION

Searching for Black Holes in Subways

Agreement. Today. l Coordination and agreement in group communication. l Consensus

Image Reconstruction And Poisson s equation

Unreliable Failure Detectors for Reliable Distributed Systems

2. One-To-All Broadcast and All-To-One Reduction. 1. Chapter 4 : Efficient Collective Communication

Network Congestion Measurement and Control

On The Energy Complexity of Parallel Algorithms

Combinatorial algorithms

Network Algorithms and Complexity (NTUA-MPLA) Reliable Broadcast. Aris Pagourtzis, Giorgos Panagiotakos, Dimitris Sakavalas

ab initio Electronic Structure Calculations

Gossip Latin Square and The Meet-All Gossipers Problem

Dynamic Programming. Data Structures and Algorithms Andrei Bulatov

VNS for the TSP and its variants

Some mathematical properties of Cayley digraphs with applications to interconnection network design

Multi-join Query Evaluation on Big Data Lecture 2

Multicore Semantics and Programming

Lecture 4: Divide and Conquer: van Emde Boas Trees

Construction of Vertex-Disjoint Paths in Alternating Group Networks

In Some Curved Spaces, We Can Solve NP-Hard Problems in Polynomial Time: Towards Matiyasevich s Dream

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

MPI parallel implementation of CBF preconditioning for 3D elasticity problems 1

Parallelization of the QC-lib Quantum Computer Simulator Library

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Parallel Algorithms. A. Legrand. Arnaud Legrand, CNRS, University of Grenoble. LIG laboratory, October 2, / 235

On Detecting Multiple Faults in Baseline Interconnection Networks

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

Coding for loss tolerant systems

Fast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array With a Reconfigurable Pipelined Bus System

NAME... Soc. Sec. #... Remote Location... (if on campus write campus) FINAL EXAM EE568 KUMAR. Sp ' 00

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University

Transcription:

Program Performance Metrics he parallel run time (par) is the time from the moment when computation starts to the moment when the last processor finished his execution he speedup (S) is defined as the ratio of the time needed to solve the problem on a single processor (seq) to the time required to solve the same problem on parallel system with "p" processors (par) relative - seq is the execution time of parallel algorithm executing on one of the processors of parallel computer real - seq is the execution time for the best-know algorithm using one of the processors of parallel computer absolute - seq is the execution time for the best-know algorithm using the best-know computer 1 1

Program Performance Metrics he efficiency (E) of parallel program is defined as a ratio of speedup to the number of processors he cost is usually defined as a product of a parallel run time and the number of processors he scalability of parallel system is a measure of its capacity to increase speedup in proportion to the number of processors

Communication costs in static Interconnection networks Principal parameters - startup time (t s ) - per-hop time (t h ) - per-word transfer time (t w ) Routing techniques - store-and-forward routing - cut-through routing

Communication costs depends on routing strategy Store and forward routing - the message is sending between different processors and each intermediate processor store it in the local memory until received the whole message ( ) tcomm t s mtw th l Cut-through routing - the message is divided on parts which are sending between processors without waiting for the whole message tcomm ts lth mtw

Basic communication operations -Simple message transfer between two processors -One-to-all broadcast -All-to-all broadcast -One-to-all personalized communication -All-to-all personalized communication - Circular shift 5

One-to-all broadcast M M M M 0 1 p-1 Single-node accumulation 0 1 p-1 M p-1 M p-1 M p-1 All-to-all broadcast M 1 M 0 M 1 M p-1 M 0 0 1 p-1 0 1 p-1 Multinode accumulation M 1 M 0 M 1 M 0 M p-1 One-to-all personalized M 1 M 0 M 0 M 1 M p-1 0 1 p-1 0 1 p-1 Single-node gather M 0,p-1 M 0,1 M 0,0 6 M 1,p-1 M 1,1 M 1,0 M p-1,p-1 M p-1,1 M p-1,0 All-to-all personalized M p-1,0 M 1,0 M 0,0 M p-1,1 M 1,1 M 0,1 M p-1.p-1 0 1 p-1 0 1 p-1 Multinode gather M 1,p-1 M 0,p-1

One-to-all broadcast - SF 7 6 5 0 1 1 a) in a ring with even number of procesors. 7 6 5 0 1 1 b) in a ring with odd number of procesors. one _ to_ all_ b t t m 7 7 s w p

8 8 One-to-all broadcast - SF 1 1 1 15 8 9 10 11 5 6 7 0 1 1 in a mesh with wraparound p m t t w s b all to one _

One-to-all broadcast - SF (110) (111) 6 7 (010) (011) 1 5 (100) (101) 0 1 (000) (001) in a hypercube one_ to_ all_ b t t mlog p s w 9 9

One-to-all broadcast - SF { 1} { } { } { } { 5} { 6} { 7} { 8} { 9} {10} {11} {1} {1} {1} {15} {16} {17} {18} {19} procedure ONE_O_ALL_BC(d,my_id,X); begin mask:= d -1; for i:=d-1 downto 0 do begin mask:=mask XOR i ; if (my_id AND mask)=0 then if (my_id AND i )=0 then begin msg_destination:=my_id XOR i ; send X to msg_destination; endif else begin msg_source:=my_id XOR i ; receive X from msg_source; endelse; endfor; end ONE_O_ALL_BC A code of one to all broadcast operation in hypercube (processor with label 0 is broadcasting its message). 10 10

One-to-all broadcast - C 15 1 1 1 11 10 9 8 1 0 1 5 6 7 in a ring onetoallbc t s log p t w mlog p t h p 1 11

One-to-all broadcast - C 1 1 1 15 8 9 10 11 5 6 7 0 1 1 in a mesh with wraparound onetoallbc t t mlog p t p 1 s w h 1 1

One-to-all broadcast - C 1 0 1 5 6 7 in a balanced binary tree onetoallbc t t mt log p 1 log p s w h 1 1

7. communication step.... communication step 1. communication step All-to-all broadcast - SF 1(7) 1(6) 1(5) 7 6 5 0 1 1() (7) (6) (5) () (0) (1) () () 1() 1(0) 1(1) 1() (6) (5) () 7 6 5 0 1 () (6,7) (5,6) (,5) (,) (0,7) (0,1) (1,) (,) () (7) (0) (1) 7(1) 7(0) 7 6 5 0 1 7(7) 7(6) (1..7) (0..6) (0..5,7) (0..,6,7) (0,..7) (0,1,..7) (0..,..7) (0..,5..7) 7(5) 7() 7() 7() 1 1

All-to-all broadcast - SF { 1} { } { } { } { 5} { 6} { 7} { 8} { 9} {10} {11} {1} {1} procedure ALL_O_ALL_BC_RING(my_id,my_msg,p,result); begin left:=(my_id - 1) mod p; right:=(my_id + 1) mod p; result:=my_msg; msg:=result; for i:=1 to p-1 do begin send msg to right; receive msg from left; result:=result msg; endfor; end ALL_O_ALL_BC_RING; alltoallbc t t m p 1 s w 15 15

All-to-all broadcast - SF 1 8 0 1 1 9 10 5 6 1 15 (1) (1) (1) (15) 11 (8) (9) (10) (11) () (5) (6) (7) (0) (1) () () 7 1 (1..15) 1 1 (1..15) (1..15) 15 (1..15) 8 (8..11) 9 10 (8..11) (8..11) 11 (8..11) (..7) 5 6 (..7) (..7) 7 (..7) 0 1 (0..) (0..) (0..) (0..) 16 16

All-to-all broadcast - SF procedure ALL_O_ALL_BC_MESH(my_id,my_msg,p,result); begin left:= {...}; right:=(...}; result:=my_msg; msg:=result; for i:=1 to p-1 do begin send msg to right; receive msg from left; result:=result msg; endfor; left:= {...}; right:=(...}; msg:=result; for i:=1 to p-1 do begin send msg to right; receive msg from left; result:=result msg; endfor; end ALL_O_ALL_BC_MESH; p 1 t mp 1 t alltoallbc s w 17 17

All-to-all broadcast - SF () (0) 0 () (6) (7) 6 1 () (1) 7 5 (,) (5) (0,1) 0 (6,7) (6,7) 6 (,5) 1 (,) (0,1) a) Initial distribution of messages b) Distribusion before the second step 7 5 (,5) (..7) (..7) 6 7 (0..7) (0..7) 6 7 (0..) (0..) (0..7) (0..7) (0..) 0 (..7) 1 (0..) 5 (..7) (0..7) 0 (0..7) 1 (0..7) 5 (0..7) c) Distribusion before the third step d) Final distribusion of messages 18 alltoallbc t s log p t m p 1 w

All-to-all broadcast with reduction - SF { 1} { } { } { } { 5} { 6} { 7} { 8} { 9} {10} {11} procedure ALL_O_ALL_BC_HCUBE(my_id,my_msg,d,result); begin result:=my_msg; for i:=1 to d-1 do begin partner:=my_id XOR i ; send result to partner; receive msg from partner; result:=result msg; endfor; end ALL_O_ALL_BC_HCUBE; alltoallbc t t mlog p s w 19 19

7. communication step.... communication step 1. communication step onetoall pers One-to-all personalized - SF t t m p 1 s w 7 6 5 0 1 1(7) 7 6 5 0 1 (6) (7) 7(7) 7 6 5 7(6) 7(5) 0 1 7() 7(1) 7() 7() 0 0

One-to-all personalized - SF (1..15) 1 1 1 15 (8..11) 8 9 10 11 (..7) 5 6 7 (0..) 0 1 1 8 1 1 9 10 5 6 15 (1) (1) (1) (15) 11 (8) (9) (10) (11) () (5) (6) (7) 7 0 1 (0) (1) () () p 1 t mp 1 t onetoall pers s w 1 1

All-to-all personalized - SF (5,0) (5,1) (5,) (5,) (5,) (,0) (,1) (,) (,) (,5) 5 0 1 (0,1) (0,) (0,) (0,) (0,5) (,0) (,1) (,) (,) (,5) (1,0) (1,) (1,) (1,) (1,5) (,0) (,1) (,) (,) (,5) 1. communication step (,0) (,1) (,) (,) (,0) (,1) (,) 5 (,5) (,5) 0 1. communication step (5,1) (0,) (5,) (0,) (5,) (0,) (5,) (0,5) (,0) (,1) (,) (1,0) (1,) (1,) (1,5)

All-to-all personalized - SF (,0) (,1) (,) (,0) (,1) (1,0) (1,) 5 (,5) (1,5) 0 1 (0,) (0,) (0,5). communication step (,1) (,) (,) (5,) (5,) (5,) (,0) (,1) (1,0) (1,5) (0,) (0,5) 5 0 1 (,1) (,) (,) (,) (5,) (5,). communication step (0,5) 5 (1,0) 5. (,) communication 0 1 step (,1) (,) (5,)

All-to-all personalized - SF 6 (6,0),(6,),(6,6) (6,1),(6,),(6,7) (6,),(6,5),(6,8) (,0),(,),(,6) (,1),(,),(,7) (,),(,5),(,8) 7 (7,0),(7,),(7,6) (7,1),(7,),(7,7) (7,),(7,5),(7,8) (,0),(,),(,6) (,1),(,),(,7) (,),(,5),(,8) 8 5 (8,0),(8,),(8,6) (8,1),(8,),(8,7) (8,),(8,5),(8,8) (5,0),(5,),(5,6) (5,1),(5,),(5,7) (5,),(5,5),(5,8) 0 (0,0),(0,),(0,6) (0,1),(0,),(0,7) (0,),(0,5),(0,8) 1 (1,0),(1,),(1,6) (1,1),(1,),(1,7) (1,),(1,5),(1,8) (,0),(,),(,6) (,1),(,),(,7) (,),(,5),(,8)

All-to-all personalized - SF 6 (6,0),(6,),(6,6) 7 (6,1),(6,),(6,7) 8 (6,),(6,5),(6,8) (7,0),(7,),(7,6) (7,1),(7,),(7,7) (7,),(7,5),(7,8) (8,0),(8,),(8,6) (8,1),(8,),(8,7) (8,),(8,5),(8,8) (,0),(,),(,6) (,1),(,),(,7) 5 (,),(,5),(,8) (,0),(,),(,6) (,1),(,),(,7) (,),(,5),(,8) (5,0),(5,),(5,6) (5,1),(5,),(5,7) (5,),(5,5),(5,8) 0 1 (0,0),(0,),(0,6) (0,1),(0,),(0,7) (0,),(0,5),(0,8) (1,0),(1,),(1,6) (1,1),(1,),(1,7) (1,),(1,5),(1,8) (,0),(,),(,6) (,1),(,),(,7) (,),(,5),(,8) t t mp p 1 alltoall pers s w 5 5