High Performance Computing

Similar documents
Our Problem. Model. Clock Synchronization. Global Predicate Detection and Event Ordering

ECE 669 Parallel Computer Architecture

An Integrative Model for Parallelism

Clocks in Asynchronous Systems

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

Overview: Synchronous Computations

CS 425 / ECE 428 Distributed Systems Fall Indranil Gupta (Indy) Oct. 5, 2017 Lecture 12: Time and Ordering All slides IG

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

2IN35 VLSI Programming Lab Work Communication Protocols: A Synchronous and an Asynchronous One

Lecture 19. Architectural Directions

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

A subtle problem. An obvious problem. An obvious problem. An obvious problem. No!

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers

Clock Synchronization

Lecture Notes on Ordered Proofs as Concurrent Programs

Virtual Circuits in Network-on-Chip

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Do we have a quorum?

Cuts. Cuts. Consistent cuts and consistent global states. Global states and cuts. A cut C is a subset of the global history of H

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Time. Today. l Physical clocks l Logical clocks

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Finite-State Model Checking

Logical Time. 1. Introduction 2. Clock and Events 3. Logical (Lamport) Clocks 4. Vector Clocks 5. Efficient Implementation

Timing Results of a Parallel FFTsynth

CMP 338: Third Class

CMP N 301 Computer Architecture. Appendix C

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

ICS 233 Computer Architecture & Assembly Language

RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE

Embeddings, Fault Tolerance and Communication Strategies in k-ary n-cube Interconnection Networks

CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues

Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms. CS 249 Project Fall 2005 Wing Wong

Multicore Semantics and Programming

CHAPTER 5 - PROCESS SCHEDULING

INF Models of concurrency

Slides for Chapter 14: Time and Global States

Fast and Accurate Transaction-Level Model of a Wormhole Network-on-Chip with Priority Preemptive Virtual Channel Arbitration

SDS developer guide. Develop distributed and parallel applications in Java. Nathanaël Cottin. version

Lecture: Pipelining Basics

Modeling Concurrent Systems

Realizability and Verification of MSC Graphs

Formal Verification of Systems-on-Chip

Exam Spring Embedded Systems. Prof. L. Thiele

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

Section 6 Fault-Tolerant Consensus

Lecture 4 Event Systems

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Distributed Algorithms Time, clocks and the ordering of events

Agreement. Today. l Coordination and agreement in group communication. l Consensus

Time. To do. q Physical clocks q Logical clocks

Divisible Load Scheduling

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Distributed Algorithms of Finding the Unique Minimum Distance Dominating Set in Directed Split-Stars

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun

I Can t Believe It s Not Causal! Scalable Causal Consistency with No Slowdown Cascades

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Today. Vector Clocks and Distributed Snapshots. Motivation: Distributed discussion board. Distributed discussion board. 1. Logical Time: Vector clocks

CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing

Time, Clocks, and the Ordering of Events in a Distributed System

Module 5: CPU Scheduling

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms

Chapter 11 Time and Global States

METHOD OF WCS CLIENT BASED ON PYRAMID MODEL

Chapter 3. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 3 <1>

CS 347 Parallel and Distributed Data Processing

Chapter 6: CPU Scheduling

Network Algorithms and Complexity (NTUA-MPLA) Reliable Broadcast. Aris Pagourtzis, Giorgos Panagiotakos, Dimitris Sakavalas

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

7. Queueing Systems. 8. Petri nets vs. State Automata

Distributed Systems. Time, Clocks, and Ordering of Events

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

CS 700: Quantitative Methods & Experimental Design in Computer Science

Analytical Modeling of Parallel Systems

Distributed Systems Principles and Paradigms. Chapter 06: Synchronization

CptS 464/564 Fall Prof. Dave Bakken. Cpt. S 464/564 Lecture January 26, 2014

UMBC. At the system level, DFT includes boundary scan and analog test bus. The DFT techniques discussed focus on improving testability of SAFs.

CIS (More Propositional Calculus - 6 points)

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Retiming. delay elements in a circuit without affecting the input/output characteristics of the circuit.

Distributed Systems Principles and Paradigms

A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability

Big Data Analytics. Lucas Rego Drumond

Lecture 25. Dealing with Interconnect and Timing. Digital Integrated Circuits Interconnect

Efficiency of Dynamic Load Balancing Based on Permanent Cells for Parallel Molecular Dynamics Simulation

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

COVER SHEET: Problem#: Points

Models for representing sequential circuits

PERFORMANCE OF A CROSSBAR NETWORK USING MARKOV CHAINS

Fault-Tolerant Consensus

CPU SCHEDULING RONG ZHENG

Arup Nanda Starwood Hotels

ECE/CS 250 Computer Architecture

EP2200 Course Project 2017 Project II - Mobile Computation Offloading

How to deal with uncertainties and dynamicity?

Transcription:

Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola), e-mail. The answers can be written in English or in Italian. Please, present the work in a legible and readable form. All the answers must be properly and clearly explained. 1) A LC parallel program is described by the following OR-graph. P 0 generates an infinite stream. Each stream element is sent to one of P 1, P 2, P 3, P 4 with the same probability. P 0 P 1 P 2 P 3 P 4 P 5 is executed on the multiprocessor architecture specified below. Each module is mapped onto a distinct CMP. The choice of the CMPs, onto which the six modules are mapped, is left to the program designer. a) Study the program behavior in terms of communication support and cache management. According to the base latency, and assuming that the service time per instruction is equal to, evaluate the effective service time and the relative efficiency of and of P 0, generic P i, P 5. b) Determine the parameters p, T p, R Q0, T s for the evaluation of the under-load latency R Q, assuming that P 1, P 2, P 3, P 4 are the most stressed modules. c) Optional: determine the value of R Q and re-evaluate the metrics of point a). 2) Explain whether the following statements are true or false or true under certain conditions: a) the instruction STORE Ra, Rb, Rc is used for transmitting a word to the communication unit (UC) of a PE. The instruction fails if UC doesn t contain a local memory component ; b) consider the evaluation of the cache-to-cache base latency to be used in the cost model of a given parallel program on a given multiprocessor architecture. Such latency is independent of the parallelism degree of the program. 3) Consider a stencil-based data-parallel implementation of the following module operating on streams: Q :: int A[M], B[M]; channel in input_stream (1); channel out output_stream; < init B >; while (true) do P 1, P 2, P 3, P 4 are defined as follows, with M = 128K: P i (i = 1.. 4) :: int A[M], B[M]; channel in input[i] (5); channel out output [i]; while (true) do receive (input[i], A); j = 0.. M 1: B[j] = F_i(A[j]); send (output[i], B) Any function F_i (i= 1.. 4) executes 20 instruction on the average. The ideal service time of both P 0 and of P 5 is 10. receive (input_stream, A); i = 0.. M 1: B[i] = F(A[i], B[(i + M 1)%M]); send (output_stream, B) a) Describe the Virtual Processors version in LC. b) For the actual implementation, discuss the impact of the stencil on the effective parallelism degree. Assume that: i) communications are overlapped, ii) data distribution/collection is not bottleneck, and iii) the calculation time of function F is much greater that the instruction service time. Architectural specifications for question 1 Multiple-CMP NUMA-SMP machine with N = 64 processing elements: 16 4-PE CMPs with internal crossbar interconnect and one MINF; D-RISC CPU with 32K + 32K primary cache, 8-word blocks, and 1M inclusive secondary cache; no communication processor; directory-based cache coherence with home-flushing semantics; external interconnect: wormhole single-buffering binary Fat Tree, 1-word flits and links, T tr = ; 64-Gigaword local memory for each CMP. LC run-time support: Rdy-Ack interprocess communication with I/O-based synchronization, exclusivemapping.

2 1) Solution a) The best homing strategy for minimizing both contention (p = 3) and communication latency is: each P i processing node is the home of the VTG shared structures VTG 1 and VTG 2, corresponding to channels input[i] and output[i] respectively. The execution of P 0 send provides to flush the A blocks (VTG 1 ) into C2 of P i (C2i) through C2C transfers, and to invalidate such blocks locally. Since M = 128K and C2 = 1M, there is ample space in C2i to store VTG 1, the computed B, and VTG 2. C1i generates 4M/ faults, all served through C1-C2 transfers: M/ for VTG 1 reading, M/ for B writing (write-back), M/ for B reading, M/ for VTG 2 writing (write-back). Thus, P i performs only local accesses; main memory updating is asynchronous and has negligible impact on performance. The execution of P 5 receive-compute provides to read VTG 2 through C2C transfers when needed. Rdy-Ack synchronizations are implemented via I/O-based wait-notify operations, which imply no shared memory operation. Base latency analysis The D-RISC code of P i is shown in the following. Since calculation and communication latencies are O(M), all the actions in receive and set ack, and the setup phase in send, have negligible latency and are not expanded: < receive > // target VTG_S pointed to by Rvtg1 // & 1 -unfoding, write-back & < set ack > COMPUTE: CLEAR Rj LOAD Rvtg1, Rj, Ra CALL RF, Rret // input and output parameter of F: in Ra, Rb respectively// STORE RB, Rj, Rb, don t_deallocate INCR Rj IF < Rj, RM, COMPUTE < send > // only the message copy is shown; target variable pointed to by Rvtg2 // & 1 -unfoding, write-back & CLEAR Rj COPY: LOAD RB, Rj, Rb STORE Rvtg2, Rj, Rb INCR Rj IF < Rj, RM, COPY

3 The ideal service time of P i is given by: T i = T i 0 + T fault ~ M (9 T istr + T F ) + 4 M σ 1 L c1 c2 (σ 1 ) Notice that, since there is no communication processor, the send communication overhead (M/ blocks readings, M/ blocks writings for message copy) is paid entirely. The C1-C2 block transfer (read or write-back) is evaluated as: Thus: L C1 C2 (σ 1 ) = (σ 1 + 2) τ = 10 τ T i = 63 M τ P i s are bottlenecks. According to the multiple-server theorem in presence of bottlenecks, the effective service times of P 0 becomes: T 0 = 1 4 T i = 15.75 M τ According to the multiple-client theorem, the interarrival time to, and the effective service time of, P 5 is: T 5 = 1 4 T i = 15.75 M τ In the base latency case, the performance measures of the single modules and of the whole computation are: ideal service time effective service time relative efficiency P 0 P i 63 P 5 b) Under-load analysis The contention on the most stressed servers C2i is measured by: p = 3 All the requested actions (read, write-flush, write-back) operate on a single block, thus the service time of C2i is T s = σ 1 τ = 8 τ C2i receives the following requests from C20, C25 and C1i: 1. M/ C2C write-flush operations with latency L 1 = L C2C-flush ( ): from C20, 2. M/ C2C read operations with latency L 2 = L C2C-read ( ): from C25, 3. 2M/ read and 2M/ operations with latency L 3 = L C1-C2 ( ) = : from C1i. The respective frequencies are: π 1 = π 2 = 1 6 π 3 = 4 6 The base latency is given by: R Q0 = 3 i=1 π i L i

4 The C2C synchronous flush operation consists in a request phase with a firmware message of length 11 (header, physical address, block words) and a reply phase with a pure-synchronization firmware message of length 1 (header). The C2 accesses are pipelined word by word. The C2C read operation consists in a request phase with a firmware message of length 3 (header, physical address) and a reply phase with a firmware message of length 9 (header, block words). The C2 accesses are pipelined word by word. Latencies L 1 and L 2 are computed as follows for single-buffering communications: (2s req + 2s reply + 2d 6)T hop with T hop = as the maximum latency in the pipelined path. In all cases, the path is: C2, W, internal net, MINF, WW, external net, WW, MINF, internal net, W, C2, whose distance is: d = 10 + d external net The best mapping on the external network consists in using only one half of the Fat Tree, in order to have a distance close to lg 2 N. To evaluate d external-net accurately, we must study the actual mapping. Since P 0 and P 5 should be symmetric with respect to the P i s, a reasonable mapping is: P0 P1 P2 P3 P4 P5 The average distance between P 0 or P 5 and a generic P i is: (formally 1 + lg 2 N). Thus: d external net = 5 d = 15 L 1 = L 2 = 96 τ R Q0 ~ 39 τ The mean time T p between two requests is given by the inverse of the bandwidth requested to C2i: We have: B req = T p = M σ 1 4T 0 + 1 B req 4 M σ 1 T i + M σ 1 4T 5 = Notice that, for each P i,we must consider the interdeparture time of P 0 towards such P i, which is equal to the service time of P i. Analogously for P 5. Thus: T p = 84 τ c - Optional) Solving the client-server model with p = 3, T p =, R Q0 = and T s = we find ρ = 0.19 6 M σ 1 T i R Q R Q0 = 1.02 Thus the evaluation of and of the single modules, which has been done for the base latency case in point a), remains valid with very good approximation.

5 2) a) False. The STORE instruction is used in Memory Mapped I/O mode: this means that the CPU process sees the I/O unit as a set of memory locations referred to by logical addresses (in our case: R[Ra] + R[Rb]). However it is not necessary that such logical addresses correspond to physical locations of a memory component. The I/O unit can interpret the received information (write, physical address, data word) in the most efficient way for implementing the provided functionality. In the UC case, each received data word is sent on the fly (i.e. without storing it) to the PE interface unit (W) in a single clock cycle. b) The base latency is linearly dependent on the inter-node distance. Thus, the statement is true only for architectures having a constant distance interconnect: in the case of C2C accesses, the only interconnect with constant distance is the crossbar. For any other interconnect (k-ary n-cubes and their particular cases, e.g. rings and 2-D meshes, Fat Trees in any implementation including Generalized Fat Tree) the statement is false: the inter-node distance depends on the parallel program structure, the number of processes and their mapping. An example has been seen in Question 1: mapping of the 6-module graph onto a 16-node Fat tree). 3) a) The Virtual Processors version is an array VP[M] with ring logical interconnect: parallel VP[M]; int A[M], B[M]; channel ring[m], input_stream[m], output_stream[m]; VP[i i = 0.. M 1] :: int A[i], B[i], int x; channel in ring[(i + M 1)%M] (1), input_stream[i] (1); channel out ring[(i + 1)%M], output_stream[i]; while (true) do < receive (input_stream[i], A[i]); > send (ring[(i + 1)%M], B[i]); receive (ring[(i + M 1)%M], x); B[i] = F(A[i], x); < send (output_stream[i], B[i]) > It is necessary that the ring channels (stencil channels) are asynchronous, in order to avoid deadlock. Asynchrony degree k = 1 is exactly what is needed. Despite asynchronous overlapped communications, the stencil is synchronous, since each VP must wait the x value before calculating function F. Notice that the Virtual Processor description doesn t include data distribution and collection. Even the communications from/onto the input/output stream (< >) are not fundamental in the Virtual Processors description. b) In the actual implementation with n workers, each worker encapsulates a partition of A and of B of size g = M/n. The input stream value A is scattered into the workers. The worker results are gathered into the output stream. In the actual stencil implementation, each worker sends the last element B[g-1] of B partition to the successive worker in the ring, and the value received from the preceding worker represents the new first element B[0] of the B partition. Thus, the stencil channels are of type integer (1 word), as in the Virtual Processors version. Because of the assumptions i), ii), iii), the optimal parallelism degree is evaluated on the sequential version of Q as follows: where T A denotes the interarrival time. n 0 = M T F T A T (n 0) = T A

6 As said, the synchronous stencil feature forces to wait a communication latency L com (1). With parallelism degree n 0, the service time becomes: The degradation is negligible if T (n 0) = g T F + L com (1) = M T F n 0 L com (1) T A + L com (1) > T A In general, the effective optimal parallelism degree n 1, i.e., able to guarantee the ideal bandwidth 1/T A, is such that: M T F n 1 + L com (1) = T A n 1 = It is: T A L com (1) > 0, otherwise no parallelization is possible. M T F T A L com (1)