Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

Similar documents
DMP. Deterministic Shared Memory Multiprocessing. Presenter: Wu, Weiyi Yale University

CMP N 301 Computer Architecture. Appendix C

ECE 571 Advanced Microprocessor-Based Design Lecture 10

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

UC Santa Barbara. Operating Systems. Christopher Kruegel Department of Computer Science UC Santa Barbara

ICS 233 Computer Architecture & Assembly Language

What is the Cost of Determinism? Cedomir Segulja, Tarek S. Abdelrahman University of Toronto

Lecture: Pipelining Basics

CS 700: Quantitative Methods & Experimental Design in Computer Science

EECS 579: Logic and Fault Simulation. Simulation

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Correspondence between operational and declarative concurrency semantics. 29 August 2017

High Performance Computing

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 17: "Introduction to Cache Coherence Protocols" Invalidation vs.

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 18: Sharing Patterns and Cache Coherence Protocols. The Lecture Contains:

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

Concurrency models and Modern Processors

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

Module 5: CPU Scheduling

Scalable Tools for Debugging Non-Deterministic MPI Applications

Chapter 6: CPU Scheduling

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

OHW2013 workshop. An open source PCIe device virtualization framework

Pwning ARM Debug Components for Sec-Related Stuff (HardBlare project)

Energy Estimation for CPU-Events Dependent on Frequency Scaling and Clock Gating. Philip P. Moltmann

CSE 380 Computer Operating Systems

New Exploration Frameworks for Temperature-Aware Design of MPSoCs. Prof. David Atienza

Reducing NVM Writes with Optimized Shadow Paging

Do we have a quorum?

Measurement & Performance

Measurement & Performance

2/5/07 CSE 30341: Operating Systems Principles

Distributed systems Lecture 4: Clock synchronisation; logical clocks. Dr Robert N. M. Watson

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Dense Arithmetic over Finite Fields with CUMODP

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

ECE 571 Advanced Microprocessor-Based Design Lecture 9

Energy-aware checkpointing of divisible tasks with soft or hard deadlines

Lecture 16 More Profiling: gperftools, systemwide tools: oprofile, perf, DTrace, etc.

The conceptual view. by Gerrit Muller University of Southeast Norway-NISE

Distributed Computing. Synchronization. Dr. Yingwu Zhu

Scalable and Power-Efficient Data Mining Kernels

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Parallel Polynomial Evaluation

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Multicore Semantics and Programming

Design and Analysis of Time-Critical Systems Response-time Analysis with a Focus on Shared Resources

CPSC 3300 Spring 2017 Exam 2

An Automotive Case Study ERTSS 2016

ERLANGEN REGIONAL COMPUTING CENTER

Cache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms

CPU scheduling. CPU Scheduling

Lecture 3, Performance

Continuous heat flow analysis. Time-variant heat sources. Embedded Systems Laboratory (ESL) Institute of EE, Faculty of Engineering

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Computer Architecture

Real-Time Scheduling and Resource Management

Clojure Concurrency Constructs, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

Cache Contention and Application Performance Prediction for Multi-Core Systems

Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University

CHAPTER 5 - PROCESS SCHEDULING

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

WRF performance tuning for the Intel Woodcrest Processor

Resilient and energy-aware algorithms

11 Parallel programming models

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

ww.padasalai.net

Exploring performance and power properties of modern multicore chips via simple machine models

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

FAGOR 8055 CNC Ordering Handbook. Ref. 0801

Timeline of a Vulnerability

Introduction The Nature of High-Performance Computation

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

CPU SCHEDULING RONG ZHENG

Efficient Modular Exponentiation Based on Multiple Multiplications by a Common Operand

Quantum computing with superconducting qubits Towards useful applications

Rethink energy accounting with cooperative game theory. Mian Dong, Tian Lan and Lin Zhong!

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

TSCCLOCK: A LOW COST, ROBUST, ACCURATE SOFTWARE CLOCK FOR NETWORKED COMPUTERS

Performance, Power & Energy

RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures

Special Nodes for Interface

Opleiding Informatica

Pseudo-Random Generators

CMP 338: Third Class

Pseudo-Random Generators

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

CPU Scheduling Exercises

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

A Physical-Aware Task Migration Algorithm for Dynamic Thermal Management of SMT Multi-core Processors

Operating Systems. VII. Synchronization

Topics. Pseudo-Random Generators. Pseudo-Random Numbers. Truly Random Numbers

One Optimized I/O Configuration per HPC Application

Compiling Techniques

Transcription:

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism Nima Honarmand, Nathan Dautenhahn, Josep Torrellas and Samuel T. King (UIUC) Gilles Pokam and Cristiano Pereira (Intel) iacoma.cs.uiuc.edu 1

Record-and-Replay (RnR) Record execution of a parallel program or a whole machine Save non-deterministic events in a log During replay, use the recoded log to enforce the same execution Each thread follows the same sequence of instructions Use cases Debugging Security High availability 2

Contribution: Cyrus RnR System Application-level RnR RnR one or more programs in isolation What users typically need Fast replay Replay-time parallelism Flexibly trade off parallelism for log size Unintrusive HW No changes to snoopy cache coherence protocol 3

Capturing Non-determinism Sources of non-determinism Program inputs Memory access interleavings How to capture? OS kernel extension to capture program inputs HW support to capture memory interleavings (HW-assisted RnR) This talk: recording memory interleavings 4

Time Recording Interleaving as Chunks Inter-processor data dependences manifest as coherence messages Capture interleavings as ordered chunks of instructions P0 add. store A. mul. sub Req Resp P1 div. load A. add. P0 add... store A... mul. sub P1 div. load A. add 5

Restriction: Unintrusive HW Unmodified snoopy protocols In some coherence transactions, there is no reply P0 P1 P0 invl P1 P0 invl P1 data P1 rd data P1 wr P1 wr RAW WAW Requirements for HW-assisted RnR: Do not augment or add coherence messages Do not rely on explicit replies Only source is always aware Use source-only recording 6 WAR

Challenge 1: Enable Replay Parallelism Key to fast replay Overlapped replay of chunks from diff. threads Previous work: DAG-based ordering (Karma [ICS 2011]) Requires explicit replies Augments coherence messages 7

Challenge 1: Enable Replay Parallelism P0 P1 P0 P1 P0 P1 P0 P1? P0 P1 Predecessor Successor 8

Challenge 2: Application-Level RnR Turn hardware on only when a recorded application runs. P0 P1 P2 Four cases: (1) src=monitoring, dst=monitoring (2) src=monitoring, dst=not monitoring (3) src=not monitoring, dst=monitoring (4) src=not monitoring, dst=not monitoring (1) (2) (3) (4) Issues of source-only recording: Cannot distinguish between (1) and (2) (2) may result in a dependence later Not recording in (3) and (4) Non-monitored Communication Monitored Application 9

Challenge 2: Application-Level RnR Treat (2) as an Early Dependence Defer and assign it to the next chunk of the target processor (3) and (4) superseded by context switches At context switch, record a Serialization Dependence to all other processors P0 P1 P2 (1) ser ser (3) (2) (4) Non-monitored Monitored Application Dependence 10

Key: On-the-Fly Backend Software Pass Recording Processors P P P On-the-fly Backend Replaying Processors P P P Sourceonly Log DAG of Chunks Transforms source-only log to DAG (for parallelism) Fixes the Early and Serialization dependences To support app-level RnR Can trade replay parallelism for log size 11

Memory Race Recording Unit (RRU) HW module that observes coherence transactions and cache evictions Tracks loads/stores of the chunk in a signature Keeps signatures for multiple recent chunks Records for each chunk # of instructions Timestamp (# of coh. transactions) Dependences for which the chunk is source Dumps recorded chunks into a log in memory P Cache Bus Mem Refs Evictions Snoops RRU 12

TimeStamp RRUs Record Source-Only Log 100 150 200 250 300 C00 C01 P0 P1 P2 Rd A Wr B C10 Wr A Rd D C20 Wr D Rd B P0 P0 P1 P2 C00 100-150 100 C01 200-200 200 P1 C10 250 - - 250 P2 Chunk TS Successor Vector C20 300 - - - 13

Backend Pass Creates DAG Finds the target chunk for each recorded dependency Creates bidirectional links between src and dst chunks This algorithm is called MaxPar C00 Chunks of P0 Chunks of P1 C00 100-150 100 C01 200-200 200 C10 250 - - 250 C01 C10 Chunks of P2 C20 300 - - - C20 14

Trading Replay Parallelism for Log Size C00 C20 C01 C10 C20 C00 + C01 C10 CPU TID SIZE PTV STV 0-0 0-1 1 0-0 0-1 1 1 2-0 0-1 2 2 1-0 0 - TID SIZE C00 + C01 C10 CPU TID SIZE PTV STV C20 0-0 0-1 1 1 2-0 0-1 2 2 1-0 0 - C00 No Parallelism Smallest log Less Parallelism Smaller log TID SIZE C01 C10 No Parallelism Even Smaller log 15 C20

Evaluation Using Simics Full-system simulation with OS Wrote a Linux kernel module to Records application inputs Controls RRUs Model 8 + 1 processors 8 processors for the app 1 processor for the backend 10 SPLASH-2 benchmarks 16

Normalized Replay Time (Log Scale) Replay Time Normalized to Recording rep-maxpar rep-stitched rep-stserial rep-serial 32 16 8 4 2 1 Large difference between MaxPar and Serial replay On 8 processors, unoptimized MaxPar replay is only 50% slower than recording 17

Conclusions Cyrus: RnR system that supports Application-level RnR Unintrusive hardware Flexible replay parallelism Key idea: On-the-fly software backend pass On 8 processors: Large difference between MaxPar and Serial replay Unoptimized replay of MaxPar is only 50% slower than recording Negligible recording overhead Upcoming ISCA 13 paper describes our FPGA RnR prototype 18