DMP. Deterministic Shared Memory Multiprocessing. Presenter: Wu, Weiyi Yale University

Similar documents
What is the Cost of Determinism? Cedomir Segulja, Tarek S. Abdelrahman University of Toronto

Real Time Operating Systems

1 st Semester 2007/2008

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Operating Systems. VII. Synchronization

CIS 4930/6930: Principles of Cyber-Physical Systems

Rollback-Recovery. Uncoordinated Checkpointing. p!! Easy to understand No synchronization overhead. Flexible. To recover from a crash:

Real Time Operating Systems

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

Distributed Transactions. CS5204 Operating Systems 1

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Analytical Modeling of Parallel Systems

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

CEC 450 Real-Time Systems

Vector Lane Threading

CHAPTER 5 - PROCESS SCHEDULING

Andrew Morton University of Waterloo Canada

UC Santa Barbara. Operating Systems. Christopher Kruegel Department of Computer Science UC Santa Barbara

Lecture 13. Real-Time Scheduling. Daniel Kästner AbsInt GmbH 2013

Safety and Liveness. Thread Synchronization: Too Much Milk. Critical Sections. A Really Cool Theorem

Embedded Systems Development

Dense Arithmetic over Finite Fields with CUMODP

/ : Computer Architecture and Design

Lecture Note #6: More on Task Scheduling EECS 571 Principles of Real-Time Embedded Systems Kang G. Shin EECS Department University of Michigan

Shared resources. Sistemi in tempo reale. Giuseppe Lipari. Scuola Superiore Sant Anna Pisa -Italy

Simulation of Process Scheduling Algorithms

Fall 2008 CSE Qualifying Exam. September 13, 2008

LSN 15 Processor Scheduling

Latches. October 13, 2003 Latches 1

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Allocate-On-Use Space Complexity of Shared-Memory Algorithms

CMP N 301 Computer Architecture. Appendix C

Real-Time Software Transactional Memory: Contention Managers, Time Bounds, and Implementations

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

CSE 380 Computer Operating Systems

1 Introduction. 1.1 The Problem Domain. Self-Stablization UC Davis Earl Barr. Lecture 1 Introduction Winter 2007

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

Origami: Folding Warps for Energy Efficient GPUs

Transactional Information Systems:

Deadlock Ezio Bartocci Institute for Computer Engineering

Counters. We ll look at different kinds of counters and discuss how to build them

Lab Course: distributed data analytics

ECE 571 Advanced Microprocessor-Based Design Lecture 10

INF 4140: Models of Concurrency Series 3

Parallel Performance Theory - 1

Extending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations

Deadlock. CSE 2431: Introduction to Operating Systems Reading: Chap. 7, [OSC]

Performance and Scalability. Lars Karlsson

arxiv: v1 [cs.dc] 9 Feb 2015

Module 5: CPU Scheduling

arxiv: v2 [cs.dc] 20 Nov 2017

Abstractions and Decision Procedures for Effective Software Model Checking

Real-Time Scheduling and Resource Management

Logic Design II (17.342) Spring Lecture Outline

Chapter 6: CPU Scheduling

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

Speculative Parallelism in Cilk++

Loop Parallelization Techniques and dependence analysis

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 17: "Introduction to Cache Coherence Protocols" Invalidation vs.

6. Finite State Machines

Clojure Concurrency Constructs, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014

Towards a Practical Secure Concurrent Language

Analytical Modeling of Parallel Programs. S. Oliveira

Variants of Turing Machine (intro)

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University

CPU Scheduling. Heechul Yun

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 18: Sharing Patterns and Cache Coherence Protocols. The Lecture Contains:

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

CPU SCHEDULING RONG ZHENG

Nondeterminism. September 7, Nondeterminism

Part I: Definitions and Properties

Process Scheduling for RTS. RTS Scheduling Approach. Cyclic Executive Approach

Solution of Linear Systems

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

Reducing NVM Writes with Optimized Shadow Paging

Complexity Theory Part I

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Highly-scalable branch and bound for maximum monomial agreement

Appendix A. Formal Proofs

CS 6112 (Fall 2011) Foundations of Concurrency

Special Nodes for Interface

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun

Synchronous Reactive Systems

Leveraging Transactional Memory for a Predictable Execution of Applications Composed of Hard Real-Time and Best-Effort Tasks

Scheduling I. Today Introduction to scheduling Classical algorithms. Next Time Advanced topics on scheduling

CSE 548: Analysis of Algorithms. Lecture 12 ( Analyzing Parallel Algorithms )

2/5/07 CSE 30341: Operating Systems Principles

CS/IT OPERATING SYSTEMS

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions

Clock-driven scheduling

How to deal with uncertainties and dynamicity?

Embedded Systems Development

Lightweight Real-Time Synchronization under P-EDF on Symmetric and Asymmetric Multiprocessors

Quantum Cluster Simulations of Low D Systems

Design Patterns and Refactoring

Adders allow computers to add numbers 2-bit ripple-carry adder

Review for the Midterm Exam

Recap from the previous lecture on Analytical Modeling

Scheduling I. Today. Next Time. ! Introduction to scheduling! Classical algorithms. ! Advanced topics on scheduling

Transcription:

DMP Deterministic Shared Memory Multiprocessing 1 Presenter: Wu, Weiyi Yale University

Outline What is determinism? How to make execution deterministic? What s the overhead of determinism? 2

What Is Determinism? The SAME input, the SAME output The same execution sequence (with Linearizability) (strong) The same resource-accessing ordering (weak) 3

(Weak) Determinism ld X ld X ld X ld Y ld Y st S st S ld Y Time Flow st S st T st X st T st T Data Dependency st X st X st Y st Y st Y Concurrent Program Execution 1 Execution 2 4

Non-Determinism ld X ld X ld X ld Y ld Y st S st S ld Y Time Flow st S st T st T st T st X Communications st X st X st Y st Y st Y Concurrent Program Execution 1 Execution 2 5

Non-determinism in Non-Determinism 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1M barnes 2M 3M 4M 5M Non-Determinism 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Practice 1M ocean-contig 2M 3M 4M 5M Time (Insns.) Time (Insns.) Figure 3. Amount of nondeterminism over the execution of barnes and ocean-contig. The x axis is the position in the execution where each sample of 100,000 instructions was taken. The y axis is ND (Eq. 1) computed for each sample in the execution. 6

Is Record n Replay Deterministic? Yes! When the log is replayed deterministically, the replayed execution is definitely deterministic No... It cannot deploy to any machines and still run deterministically for any input 7

How to Execute Deterministically? Eliminating all instruction communications Deterministically arrange all communications Read-after-Write Write-after-Write, Write-after-Read 8

DMP-Serial No concurrency! Make it single-threaded! Programs are divided into quanta All processors act as one Only execute when hold token Deterministically pass token when quanta end 9

Example of DMP-Serial ld X ld X ld Y ld X ld Y st S Time Flow st S st S ld Y st T st X st T st T Token Pass st Y st X st Y st X st Y Quantum Concurrent Program Small Quanta Bigger Quanta 10

Recovering Parallelism DMP-Serial is way too strong Weak determinism is also acceptable Instructions without communication can execute concurrently Deterministically schedule communications 11

DMP-ShTab Break quantum into parallel prefix and serial suffix (dynamically), do prefix concurrently Deterministically control shared memory access Only read from Shared memory position Only write to Private memory position 12

Example of DMP-ShTab ld X ld Y ld X ld X st S st S ld Y st S ld Y Time Flow st T st T st T st X st X st Y Token Pass st Y st X st Y Quantum Concurrent Program DMP-Serial DMP-ShTab 13

Similarity to MESI coherence protocol Probe Write Hit Reset Invalid Read Miss, Exclusive Exclusive Read Hit Read Miss, Shared Probe Write Hit Probe Read Hit Probe Write Hit Write Hit Read Hit Probe Read Hit Shared Probe Read Hit Write Hit Write Miss Modified Read Hit Write Hit 14

Similarity to MESI coherence protocol Probe Write Hit Reset Invalid Read Miss, Exclusive Exclusive Read Hit Read Miss, Shared Probe Write Hit Probe Read Hit Probe Write Hit Write Hit Read Hit Probe Read Hit Shared Probe Read Hit Write Hit Write Miss Modified Read Hit Write Hit Shared Private 14

DMP-TM Atomicity, isolation and deterministic total order of quanta will guarantee determinism Make quantum transaction to explicitly execute concurrently Deterministically commit Abort and re-execute latter quantum when conflicting 15

DMP-TMFwd Read-after-Write will be no conflict if propagating modified values in time Forward writes across transactions Need to abort infected transactions when aborting 16

Example of DMP-TM* N memory operation deterministic token passing deterministic order atomic quantum uncommitted value flow P0 P1 P0 P1 1 ld B st C (squash) st C 1 ld B 2 3 ld T st C 2 3 ld B ld B 4 (a) Pure TM ld B 4 (b) TM w/ uncommitted value flow Figure 7. Recovering parallelism by executing quanta as memory transactions (a). Avoiding unnecessary squashes with un-committed data forwarding (b). 17

How to build quanta? Split code evenly - QB-Count The less waiting time, the better No spinning on lock - QB-SyncFollow (token is enough for synchronization) End quantum when finishing working on shared data others wait for - QB-Sharing 18

Example of QB-SyncFollowing forward immediately if a thread starts spinning on a lock. memory operation deterministic token passing N deterministic order atomic quantum P0 P1 P0 P1 lock L lock L 1 unlock L lock L... spin... (wasted work) 2 1 unlock L lock L (grab) 2 3 grab lock L 3 4 (a) Regular quantum break 4 (b) Better quantum break Figure 8. Example of a situation when better quantum breaking policies leads to better performance. 19

Example of QB-Sharing P0 P1 P0 P1 P0 P1 (a) Parallel execution (b) Deterministic serialized execution at a fine-grain (c) Deterministic serialized execution at a coarse-grain Figure 4. Deterministic serialization of memory operations. Dots are memory operations and dashed arrows are happens-before synchronization. 20

Overhead & Scalability 3.0 10.7 6.2 3.7 3.4 10.2 6.1 3.6 Hw-DMP-TMFwd Hw-DMP-TM Hw-DMP-ShTab Hw-DMP-Serial 3.8 3.1 10.3 6.8 3.6 11.0 7.0 3.8 8.4 5.2 3.1 5.6 3.2 6.4 4.3 12.1 7.0 3.8 3.2 4.4 3.3 11.8 6.9 3.8 6.7 4.6 runtime normalized to non-deterministic parallel execution 2.5 2.0 1.5 1.0 4 8 barnes cholesky 4 8 16 4 8 fft(p) 16 4 fmm 8 4 lu-nc 8 ocean-c(p) 4 8 16 4 8 16 radix(p) 4 8 vlrend 16 water-sp 4 8 16 4 8 16 SPLASH 4 8 16 blacksch 4 8 16 bodytr 4 8 fluid 16 4 8 16 strmcl 4 8 16 PARSEC gmean gmean Figure 9. Runtime overheads with 4, 8 and 16 threads. (P) indicates page-level conflict detection; line-level otherwise. up over word-granularity 500% 450% 400% 350% 300% 250% 200% 150% 100% Hw-DMP-TMFwd Hw-DMP-TM Hw-DMP-ShTab 21 eedup over QB-Count 180% 160% 140% 120% 100% 80% 60% 40% Hw-DMP-TMFwd Hw-DMP-TM Hw-DMP-ShTab Hw-DMP-Serial

Quanta Sensitivity Figure 9. Runtime overheads with 4, 8 and 16 threads. (P) indicates page-level conflict detection; line-level otherwise. however, Hw-DMP-Serial is less affected by quantum size. % speedup over word-granularity 500% 450% 400% 350% 300% 250% 200% 150% 100% 50% 0% -50% -100% % speedup over 1,000-insn quanta 40% 20% 0% -20% -40% -60% -80% -100% fft 2 X C lu-nc lu-nc ocean-c 2 X C vlrend radix SPLASH gmean vlrend 2 X C SPLASH gmean Hw-DMP-TMFwd Hw-DMP-TM Hw-DMP-ShTab Hw-DMP-TMFwd Hw-DMP-Serial Hw-DMP-TM Hw-DMP-ShTab 2 X C strmcl 2 X C PARSEC gmean Figure 10. Performance of 2,000 (2), 10,000 (X) and Figure100,000 11. Performance (C) instruction of page-granularity quanta, relative to conflict 1,000 detectiontion relative quanta. to instruc- line-granularity. bodytr fluid PARSEC gmean strmcl % speedup over QB-Count 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% -20% -40% erate progress along the application s critical path. With 10,000-instruction quanta (Figure 13), the effects of the different quantum builders are more pronounced. In general, the quantum builders that take program synchronization into account (QB-SyncFollow and QB-SyncSharing) outperform those that do not. Hw-DMP-Serial and Hw-DMP-ShTab perform better with smarter quantum building, while the TMs sf ss barnes s sf ss radix(p) s sf ss SPLASH gmean Hw-DMP-TMFwd Hw-DMP-TM Hw-DMP-ShTab Hw-DMP-Serial s sf ss bodytr s sf ss strmcl s sf ss PARSEC gmean Figure 12. Performance of QB-Sharing (s), QB- SyncFollow (sf) and QB-SyncSharing (ss) quantum builders, relative to QB-Count, with 1,000-insn quanta. eedup over QB-Count 22 180% 160% 140% 120% 100% 80% 60% 40% Hw-DMP-TMFwd Hw-DMP-TM Hw-DMP-ShTab Hw-DMP-Serial

Software-DMP 8.5 8.0 7.5 7.0 2 threads 4 threads 8 threads Sw-DMP-ShTab runtime normalized to nondeterministic parallel execution 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 barnes fft fmm lu-c radix water-sp SPLASH gmean Figure 14. Runtime of Sw-DMP-ShTab relative to nondeterministic execution. y ability of Swity and energy trade-offs; (2) support for debugging; (3) interaction with operating system and I/O nondeterminism; 23 and

Conclusion (First?) implementation of DMP Demonstration of low-overhead determinism for concurrent programs Not pervasive No deterministic interrupts Can not save nondeterministic hardwares

Thanks~ 25