Scalable Store-Load Forwarding via Store Queue Index Prediction

Similar documents
Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

CSCI-564 Advanced Computer Architecture

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

CMP N 301 Computer Architecture. Appendix C

Unit 6: Branch Prediction

EE 660: Computer Architecture Out-of-Order Processors

CPSC 3300 Spring 2017 Exam 2

Fall 2011 Prof. Hyesoon Kim

Portland State University ECE 587/687. Branch Prediction

A Detailed Study on Phase Predictors

CIS 371 Computer Organization and Design

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS KUNXIANG YAN, B.S. A thesis submitted to the Graduate School

/ : Computer Architecture and Design

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

Performance Prediction Models

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

[2] Predicting the direction of a branch is not enough. What else is necessary?

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

3. (2) What is the difference between fixed and hybrid instructions?

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

Enrico Nardelli Logic Circuits and Computer Architecture

Microprocessor Power Analysis by Labeled Simulation

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

COVER SHEET: Problem#: Points

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Vector Lane Threading

Department of Electrical and Computer Engineering The University of Texas at Austin

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Lecture: Pipelining Basics

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University

[2] Predicting the direction of a branch is not enough. What else is necessary?

Simple Instruction-Pipelining. Pipelined Harvard Datapath

ICS 233 Computer Architecture & Assembly Language

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

Profile-Based Adaptation for Cache Decay

CS 700: Quantitative Methods & Experimental Design in Computer Science

AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE

CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II

Memory, Latches, & Registers

Computer Architecture

CIS 371 Computer Organization and Design

Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors

4. (3) What do we mean when we say something is an N-operand machine?

Origami: Folding Warps for Energy Efficient GPUs

CSE Computer Architecture I

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Contents. Chapter 3 Combinational Circuits Page 1 of 36


Branch Prediction using Advanced Neural Methods

ALU A functional unit

BETTER BRANCH PREDICTION THROUGH PROPHET/CRITIC HYBRIDS

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes

Automata Theory CS S-12 Turing Machine Modifications

Energy Delay Optimization

ECE 571 Advanced Microprocessor-Based Design Lecture 9

ALU, Latches and Flip-Flops

Programmable Logic Devices II

Timing Analysis with Clock Skew

I Can t Believe It s Not Causal! Scalable Causal Consistency with No Slowdown Cascades

Boolean Algebra and Digital Logic 2009, University of Colombo School of Computing

Leveraging Transactional Memory for a Predictable Execution of Applications Composed of Hard Real-Time and Best-Effort Tasks

Lecture 5 - Assembly Programming(II), Intro to Digital Filters

CS 147: Computer Systems Performance Analysis

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

Control-Theoretic Techniques and Thermal-RC Modeling for Accurate and Localized Dynamic Thermal Management

Lecture 34: Portable Systems Technology Background Professor Randy H. Katz Computer Science 252 Fall 1995

A Mathematical Solution to. by Utilizing Soft Edge Flip Flops

Reducing NVM Writes with Optimized Shadow Paging

Numbers and Arithmetic

ENEE350 Lecture Notes-Weeks 14 and 15

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Scalability is Quantifiable

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

C.K. Ken Yang UCLA Courtesy of MAH EE 215B

Fast Path-Based Neural Branch Prediction

Large-Scale Behavioral Targeting

Fundamentals of Computer Systems

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.

Number of solutions of a system

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

Computer Architecture ELEC2401 & ELEC3441

Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions

Formal Online Methods for Voltage/Frequency Control in Multiple Clock Domain Microprocessors

Distributed Systems Part II Solution to Exercise Sheet 10

A Novel Meta Predictor Design for Hybrid Branch Prediction

Synchronous Reactive Systems

Skew-Tolerant Circuit Design

Models of Computation, Recall Register Machines. A register machine (sometimes abbreviated to RM) is specified by:

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Transcription:

Scalable Store-Load Forwarding via Store Queue Index Prediction Tingting Sha, Milo M.K. Martin, Amir Roth University of Pennsylvania {shatingt, milom, amir}@cis.upenn.edu addr addr addr (CAM) predictor addr (RAM) data (RAM) data ==

Four Aspects of OoO Load/Store Execution The four aspects 1. Commit stores in order 2. Detect memory-ordering violations 3. Reduce memory-ordering violations 4. Forward from stores to loads and state-of-the-art implementations 1. Age ordered store queue (SQ) 2. Age ordered load queue (LQ) + associative search Proposed: in-order load re-execution [Cain+ 04] 3. Dependence prediction [Kessler+ 94, Moshovos+ 97, Chrysos+ 98 ] 4. Associative store queue search We consider the first three aspects solved [ 2 ]

The Problem addr D$ addr (CAM) age logic data (RAM) data Age-ordered associative search is not scalable Simple associative search: CAM is slow, difficult to pipeline But can be partitioned (to reduce wire delay) Age-ordered associative search: CAM and age logic are slow And age logic makes partitioning difficult Load queue load re-execution [Cain+ 04] Store queue? [ 3 ]

Possible Solutions Engineer around associative search Put your best designer on the store queue Longer access latency than D$ nasty scheduler, replays Put your best designer on the scheduler Proposed: reduce associative search Reduce bandwidth Bloom-filtered SQ [Sethumadhavan+ 03] Reduce number of stores searched Pipelined/chained SQ [Park+ 03] Hierarchical/filtered SQ [Srinivasan+ 04, Ghandi+ 05, Torres+ 05] Decomposed SQ [Roth 05; Baugh+ 04] Why not just eliminate associative search? [ 4 ]

addr Our Solution addr SQ index predictor D$ addr (CAM) age logic NOT SCALABLE data (RAM) D$ == addr (RAM) data (RAM) data data Replace associative search with indexed access Replace address CAM + age logic with address RAM Predict one SQ index per load Predictor (e.g., Store Sets) is not on load critical path Keep address match allow false positives boost accuracy [ 5 ]

Verification Speculative indexed access requires verification In-order load re-execution prior to commit [Cain+ 04] Store Vulnerability Window (SVW) re-execution filter [Roth 05] + On average 3% of loads re-execute almost free + Works unmodified for speculative indexed SQ (address check) Re-execution + Indexed store queue = CAM-free load/store unit [ 6 ]

Outline Introduction and background Forwarding index prediction Mechanism Evaluation Delay index prediction Mechanism Evaluation Conclusion [ 7 ]

SSNs: A Naming System for Dynamic Stores SSNs (Store Sequence Numbers) Required for SVW and more convenient than store queue indices + Can name committed stores + Fewer wrap-around issues Monotonically increasing SSN commit : youngest committed store SSN dispatch : youngest dispatched store (SSN commit + SQ.NUM) From SSN to store queue index? If st.ssn > SSN commit, st.index = (st.ssn % SQ.SIZE) [ 8 ]

Indexed Forwarding Pipeline Actions Fetch Decode Rename Issue Execute Execute SVW Re-exec Commit address D$ forwarding predictor SSN fwd SQ = MUX data & Decode/rename: predict forwarding store (SSN fwd ) Issue: wait for load address, SSN fwd to execute Unify: SSN fwd used for both forwarding and scheduling Execute: index SQ, forward if address matches SVW/re-execute: verify forwarding Commit: train predictor [ 9 ]

Forwarding Index Predictor Decode Rename Load PC FSP PC fwd1 PC fwd2 predictor SAT SSN fwd1 SSN fwd2 MAX SSN fwd Forwarding Store Predictor (FSP) Maps load PC to small set of likely-to-forward store PCs (Load) PC-indexed, set-associative, entry={tag, (2) store PCs} Store Alias Table (SAT) Maps store PC to its most recent store instance (SSN) (Store) PC-indexed, direct-mapped, entry={ssn} SSN fwd : largest SSN (youngest in-flight store) Design inspired by Store Sets [Chrysos+ 98] Used for load scheduling and forwarding [ 10 ]

Working Example: Forwarding (Store Part) PC Z: store 8, A PC Z: PC W: store 8, A ld A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit FSP SAT W X Y Z Z 18 19 Predictor 19 Z 18 A B Y 5 6 B 4 Data cache SSN commit = 17 tail tail head SQ [ 11 ]

Working Example: Forwarding (Store Part) PC Z: store 8, A PC Z: PC W: store 8, A ld A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit FSP SAT W X Y Z Z 18 19 Predictor 19 18 A B Z Y 5 6 A B 8 4 SQ tail head Data cache SSN commit = 17 [ 12 ]

Working Example: Forwarding (Load Part) PC fwd =? PC W: ld A PC Z: PC W: store 8, A ld A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit FSP SAT W X Y Z Z 18 19 Predictor 19 18 A B Z Y 5 6 A B 8 4 SQ tail head Data cache SSN commit = 17 [ 13 ]

Working Example: Forwarding (Load Part) SSN fwd =? PC fwd = Z PC W: ld A PC Z: PC W: store 8, A ld A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit FSP SAT W X Y Z Z 18 19 Predictor 19 18 A B Z Y 5 6 A B 8 4 SQ tail head Data cache SSN commit = 17 [ 14 ]

Working Example: Forwarding (Load Part) SSN fwd = 19 PC W: ld A PC Z: PC W: store 8, A ld A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit FSP SAT W X Y Z Z 18 19 Predictor 19 18 A B Z Y 5 4 A B 8 4 tail head Data cache SSN commit = 18 SQ [ 15 ]

Working Example: Verification (Store Part) PC Z: store 8, A PC W: ld A, 8 PC Z: store 8, A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit FSP SAT W X Y Z Z 18 19 Predictor 19 Z Y A B 8 4 SQ head tail A B 58 4 Data cache SSN commit = 19 [ 16 ]

Working Example: Verification (Load Part) PC Z: store 8, A PC W: ld A, 8 Successful Forwarding PC W: ld A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit A B FSP SAT Z Y 8 4 W X Y Z A Z 18 19 B Data cache 8 4 Predictor SQ head tail SSN commit = 19 [ 17 ]

Forwarding Index Predictor Training Forwarding Store Predictor (FSP) Train on every load at commit Address-indexed tables track PCs, SSNs of last committed stores if (SSN fwd == committed_store_ssn[load.addr]) correct -> re-inforce else if (load.pc fwd!= committed_store_pc[load.addr]) wrong -> learn new load-store pair else wrong -> unlearn existing store-load pair (later) Store Alias Table (SAT) Not a predictor, analogous to RAT (register alias table) [ 18 ]

So Far, How Well Are We Doing? What we care about Forwarding prediction accuracy Performance vs. associative SQ Basic setup Simulator: simplescalar Alpha 8-way superscalar, 19 stage pipe, 512-entry ROB Benchmarks: SPEC2000, Mediabench (only show 9 benchmarks) Important Parameters 64KB D$: 3 cycles 64-entry SQ: associative 3 & 5 cycles, indexed 3 cycles CACTI 3.2: 90nm, 1.1V, 3GHz FSP: 4K-entry, 2-way Bigger than Store Sets: all dependences, not just violations Probably OK: PC-indexed, in-order front-end, only 8KB [ 19 ]

Forwarding Index Prediction Accuracy 100 Percent of loads 99.5 99 98.5 98 lucas jpeg.e perl.s apsi gsm.e eon.c vortex wupwise mesa.t 99.5%+ accuracy (better than branch prediction), why? ~88% loads don t forward Address check these ~88% are right by design ~12% loads forward Forwarding patterns are stable [Moshovos+, 97] [ 20 ]

First Things First: Baseline Performance Relative execution time 1.15 1.1 1.05 1 lucas jpeg.e perl.s apsi gsm.e eon.c vortex wupwise mesa.t Baseline: 3 cycle associative SQ and perfect scheduling + (slightly modified) Store Sets scheduling Store sets is basically a perfect scheduler + 5 cycle associative SQ (forwarding triggers replays) Latency/replays cause extra 1-10% slowdown [ 21 ]

Indexed Forwarding Performance Relative execution time 1.15 1.1 1.05 1 lucas jpeg.e perl.s apsi gsm.e eon.c vortex wupwise mesa.t 3 cycle indexed SQ (with unified scheduling) + Forwarding accuracy high outperforms 5-cycle associative Forwarding accuracy low slowdowns [ 22 ]

Sources of Low Forwarding Prediction Accuracy 1.15 1.1 FSP limitation Our FSP: 2 stores per load eon.c and vortex 1.05 1 gsm.e eon.c vortex wupwise mesa.t Not-most-recent forwarding Example:for (;;) { X[J] = A * X[J 2]; } WhenJ = 5, loadx[5 2] depends on storex[3] but SAT tracks the most recent store instance (X[4]) mesa.t [ 23 ]

Delay Index Prediction for Difficult Loads Uniform solution to low forwarding prediction accuracy Convert flush (really bad) to scheduling delay (less bad) Delay load until uncertain stores commit, get value from cache Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit address D$ forwarding predictor SSN fwd SQ = MUX data delay predictor SSN delay & Decode/rename: predict + delay store (SSN delay ) Issue: wait for + SSN delay to commit [ 24 ]

Delay Index Predictor Decode Rename Load PC FSP SAT MAX DDP Distance Delay delay predictor Delay Distance Predictor (DDP) SSN dispatch SUB Maps load PC to distance (in stores) to forwarding store PC-indexed, set-associative table, entry = {tag, distance} SSN fwd SSN delay SSN delay = SSN dispatch Distance delay Example: not-most-recent forwarding (forj=5) SSN dispatch (X[4]) Distance delay (1) SSN delay (X[3]) Inspired by Exclusive Collision Predictor [Yoaz+ 99] See our paper for training [ 25 ]

Indexed Forwarding (with Delay) Performance Relative execution time 1.15 1.1 1.05 1 lucas jpeg.e perl.s apsi gsm.e eon.c vortex wupwise mesa.t [ 26 ]

Indexed Forwarding (with Delay) Performance Relative execution time 1.15 1.1 1.05 1 lucas jpeg.e perl.s apsi gsm.e eon.c vortex wupwise mesa.t 3 cycle indexed SQ + delay Forwarding prediction is good: not much impact (sometimes ±1%) + FSP 2-store limitation: a few short delays ± Not-most-recent forwarding: many long delays Better than flushing but not much [ 27 ]

Performance Summary Relative execution time 1.15 1.1 1.05 1 31 16 lucas jpeg.e perl.s apsi gsm.e eon.c vortex wupwise mesa.t 3 cycle indexed SQ + delay + On average 3% slower than 3 cycle associative SQ (black) + On average as fast as 5-cycle associative SQ (blue) [ 28 ]

Problem Conclusion Scalability of store-load forwarding Approach Keep age-ordered store queue Replace associative search with indexed access Adapted Store-Sets predictor predicts forwarding index Adapted Exclusive Collision predictor delays difficult loads Effectiveness + As fast as realistic 5 cycle associative SQ + But simpler: unified scheduler, no forwarding replays, Indexed store queue + load re-execution = CAM free in-flight load/store unit [ 29 ]

The End Thank you! [ 30 ]