Reducing NVM Writes with Optimized Shadow Paging

Similar documents
ECE 571 Advanced Microprocessor-Based Design Lecture 10

1 st Semester 2007/2008

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

ECE 571 Advanced Microprocessor-Based Design Lecture 9

Chapter 7. Sequential Circuits Registers, Counters, RAM

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University

Lecture 2: Metrics to Evaluate Systems

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

416 Distributed Systems

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

MTJ-Based Nonvolatile Logic-in-Memory Architecture and Its Application

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

AS the number of cores per chip continues to increase,

Jim Held, Ph.D., Intel Fellow & Director Emerging Technology Research, Intel Labs. HPC User Forum April 18, 2018

Origami: Folding Warps for Energy Efficient GPUs

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Lecture 19. Architectural Directions

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

/ : Computer Architecture and Design

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Strassen s Algorithm for Tensor Contraction

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Impact of Extending Side Channel Attack on Cipher Variants: A Case Study with the HC Series of Stream Ciphers

1 Short adders. t total_ripple8 = t first + 6*t middle + t last = 4t p + 6*2t p + 2t p = 18t p

CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II

DMP. Deterministic Shared Memory Multiprocessing. Presenter: Wu, Weiyi Yale University

SEMICONDUCTOR MEMORIES

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Semiconductor memories

Ensemble Consistency Testing for CESM: A new form of Quality Assurance

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Timeline of a Vulnerability

Announcements. Project #1 grades were returned on Monday. Midterm #1. Project #2. Requests for re-grades due by Tuesday

CS 700: Quantitative Methods & Experimental Design in Computer Science

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes

Coding for loss tolerant systems

Stat Mech: Problems II

I/O Devices. Device. Lecture Notes Week 8

ICS 233 Computer Architecture & Assembly Language

CMP N 301 Computer Architecture. Appendix C

COMPUTER SCIENCE TRIPOS

CIS 371 Computer Organization and Design

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Scalable Store-Load Forwarding via Store Queue Index Prediction

Analysis and Construction of Galois Fields for Efficient Storage Reliability

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Design and Analysis of Time-Critical Systems Response-time Analysis with a Focus on Shared Resources

Daniel J. Bernstein University of Illinois at Chicago. means an algorithm that a quantum computer can run.

Leveraging Transactional Memory for a Predictable Execution of Applications Composed of Hard Real-Time and Best-Effort Tasks

GMU, ECE 680 Physical VLSI Design 1

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Performance, Power & Energy

Caches in WCET Analysis

Semiconductor Memories

Looking at a two binary digit sum shows what we need to extend addition to multiple binary digits.

2. Accelerated Computations

Optimized LU-decomposition with Full Pivot for Small Batched Matrices S3069

Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation*

Calculating Algebraic Signatures Thomas Schwarz, S.J.

Rollback-Recovery. Uncoordinated Checkpointing. p!! Easy to understand No synchronization overhead. Flexible. To recover from a crash:

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

Hardware implementations of ECC

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 17: "Introduction to Cache Coherence Protocols" Invalidation vs.

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

FPGA Implementation of a Predictive Controller

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Magnetic core memory (1951) cm 2 ( bit)

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

System Data Bus (8-bit) Data Buffer. Internal Data Bus (8-bit) 8-bit register (R) 3-bit address 16-bit register pair (P) 2-bit address

Drowsy cache partitioning for reduced static and dynamic energy in the cache hierarchy

CIS 371 Computer Organization and Design

Module 9: Addendum to Module 6: Shared Memory Multiprocessors Lecture 18: Sharing Patterns and Cache Coherence Protocols. The Lecture Contains:

CMP 338: Third Class

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Digital Integrated Circuits A Design Perspective. Semiconductor. Memories. Memories

P Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam.

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

In-Memory Computing of Akers Logic Array

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Digital Integrated Circuits A Design Perspective

Timeline of a Vulnerability

Clojure Concurrency Constructs, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Lecture: Pipelining Basics

Speculative Parallelism in Cilk++

IBM Research Report. Performance Metrics for Erasure Codes in Storage Systems

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

Comparing the Effects of Intermittent and Transient Hardware Faults on Programs

Fall 2008 CSE Qualifying Exam. September 13, 2008

Transcription:

Reducing NVM Writes with Optimized Shadow Paging Yuanjiang Ni, Jishen Zhao, Daniel Bittman, Ethan L. Miller Center for Research in Storage Systems University of California, Santa Cruz

Emerging Technology Memory Storage Byte-Addressable High speed Volatile Small capacity BNVM Block-Addressable Slow Durable Large Capacity 2

New Storage Architecture read()/write() cache-line load/store cache-line flush DRAM BNVM fsync(), etc page HDD/SSD 3

Crash Consistency A 1,000,000 XBEGIN B 1,000,000 A.account -= 500,000 B.account += 500,000 XEND A 500,000 B 1,000,000 Crash-consistency is a must! A, B lost money! 4

Opportunities Leverage byte-addressability e.g Fine-grained logging. 5

Opportunities Leverage byte-addressability e.g Fine-grained logging. Leverage virtual memory Indirection is necessary for many techniques Can we directly leverage virtual memory indirection? 6

Opportunities Leverage byte-addressability e.g Fine-grained logging. Leverage virtual memory Indirection is necessary for many techniques Can we directly leverage virtual memory indirection? Explore Hardware Support Intel proposes instructions such as clwb for especially persistent memory. Other HW supports? 7

Inefficiencies of Existing Approaches Extra writes to NVM are bad. Performance Endurance 8

Inefficiencies of Existing Approaches X Extra Writes Y Newly Written Z Data from Last Commit Extra writes to NVM are bad. Data Log A B A B Write twice Performance Logging C D Endurance Logging Write the actual data twice 9

Inefficiencies of Existing Approaches X Extra Writes Y Newly Written Z Data from Last Commit Extra writes to NVM are bad. Data Log A B A B Write twice Performance Logging C D Endurance Logging P0 P1 Write the actual data twice Shadow A B A B Paging shadow paging C D C D Copy unmodified Copy unmodified data 10

Inefficiencies of Existing Approaches X Extra Writes Y Newly Written Z Data from Last Commit Extra writes to NVM are bad. Data Log A B A B Write twice Performance Logging C D Endurance Logging P0 P1 Write the actual data twice Shadow A B A B Paging shadow paging C D C D Copy unmodified Copy unmodified data P0 P1 Our approach - OSP A B A B OSP C D 11

Cache-line Level Mapping Track modifications at cache line level? Can t simply reduce page size! 12

Cache-line Level Mapping P0 P1 Two bits per cache line Committed Bit - Where is the old state? Updated Bit - has this cache line been updated? Only required when pages are being actively updated! 13

TLB Extension Wider TLB entry Committed bitmap Updated bitmap Additional PPN 14

TLB Extension Wider TLB entry Committed bitmap Updated bitmap Additional PPN Minimal impact on run-time performance. Require only few gate delays Done in parallel with cache access (e.g. VIPT caches) 15

TLB Extension Wider TLB entry Committed bitmap Updated bitmap Additional PPN Minimal impact on run-time performance. Require only few gate delays Done in parallel with cache access (e.g. VIPT caches) Need not change the PTE Additional information required only when pages are actively being updated. 16

Example P0 P1 VPN P0 P1 Committed Updated Wider TLB entry V P0 P1 1010 0000 17

Read the cache line 0 Read from P ( Committed_bit XOR updated_bit ) P0 P1 VPN P0 P1 Committed Updated Wider TLB entry V P0 P1 1010 0000 18

Update the cache line 0 Writes go to P ( Committed_bit XOR 1 ) And, set the updated_bit P0 P1 VPN P0 P1 Committed Updated Wider TLB entry V P0 P1 1010 1000 19

Update the cache line 1 Writes go to P ( Committed_bit XOR 1 ) And, set the updated_bit P0 P1 VPN P0 P1 Committed Updated Wider TLB entry V P0 P1 1010 1100 20

Commit committed bitmap = (committed bitmap XOR updated bitmap) And, clear the updated bitmap P0 Before After P0 P1 P1 VPN P0 P1 Committed Updated VPN P0 P1 Committed Updated V P0 P1 1010 1100 V P0 P1 0110 0000 21

Abort Clear the updated bitmap P0 Before After P0 P1 P1 VPN P0 P1 Committed Updated VPN P0 P1 Committed Updated V P0 P1 1010 1100 V P0 P1 1010 0000 22

Page Consolidation Double physical pages can waste memory space Reduce storage cost Consolidating virtual pages that are not being actively updated. Copy valid data into one page and free the other one. TLB eviction identifies inactive virtual pages. Page Consolidation is not a per-transaction overhead. 23

Multi-page Atomicity Consistent State Table VPN Committed V1 V2 Can't atomically update separate locations in-place 24

Lightweight Journaling Consistent State Table VPN Committed Journaling Completed V1 V2 V1 Bitmap1 V2 Bitmap2 TX-END V2 Bitmap2 uncompleted Lightweight and not a per-update overhead! 25

Experiment Setup Based on McSimA+ 64-entry L1 DTLB Transactional workloads: array swap (SPS), hashtable (HT), RBtree (RBT), B-tree (BT) *-uni : inserts/deletes in a uniformly random fashion *-zipf : inserts/deletes following Zipf distribution 1G ~ 4G footprint Metric: CPU flush 26

CPU Flushes CPU flushes (normalized) 1.0 0.8 0.6 0.4 0.2 0.0 SPS baseline (undo-log) HT-uni HT-zipf RBT-uni RBT-zipf BT-uni BT-zipf Reduces the number of CPU flushes by 1.6x on average OSP 27

Breakdown CPU flushes (normalized) 120 100 80 60 40 20 0 in-place journaling consolidation SPS HT-uni HT-zipf RBT-uni RBT-zipf BT-uni BT-zipf Nearly eliminate all of the consistency cost for workloads with locality 28

Discussion Limitations. Size of a transaction is limited by the TLB capacity Fallback path. TLB coherence for multi-threaded processes Overhead, correctness Work with virtual cache 29

Conclusion Use virtual memory system to implement efficient, transactional update avoid extra copies required by logging Keep two copies of each page being modified Track modifications at the cache line level Avoid the inefficiencies of traditional shadow paging Small changes to hardware: TLB extension Preliminary simulation shows great promise 30

Questions Collaborators: Yuanjiang Ni (yni6@ucsc.edu) Jishen Zhao (jzhao@eng.ucsd.edu ) Daniel Bittman (dbittman@ucsc.edu ) Ethan Miller (elm@ucsc.edu) 31