Scalable Store-Load Forwarding via Store Queue Index Prediction

Size: px
Start display at page:

Download "Scalable Store-Load Forwarding via Store Queue Index Prediction"

Transcription

1 Scalable Store-Load Forwarding via Store Queue Index Prediction Tingting Sha, Milo M.K. Martin, Amir Roth University of Pennsylvania {shatingt, milom, addr addr addr (CAM) predictor addr (RAM) data (RAM) data ==

2 Four Aspects of OoO Load/Store Execution The four aspects 1. Commit stores in order 2. Detect memory-ordering violations 3. Reduce memory-ordering violations 4. Forward from stores to loads and state-of-the-art implementations 1. Age ordered store queue (SQ) 2. Age ordered load queue (LQ) + associative search Proposed: in-order load re-execution [Cain+ 04] 3. Dependence prediction [Kessler+ 94, Moshovos+ 97, Chrysos+ 98 ] 4. Associative store queue search We consider the first three aspects solved [ 2 ]

3 The Problem addr D$ addr (CAM) age logic data (RAM) data Age-ordered associative search is not scalable Simple associative search: CAM is slow, difficult to pipeline But can be partitioned (to reduce wire delay) Age-ordered associative search: CAM and age logic are slow And age logic makes partitioning difficult Load queue load re-execution [Cain+ 04] Store queue? [ 3 ]

4 Possible Solutions Engineer around associative search Put your best designer on the store queue Longer access latency than D$ nasty scheduler, replays Put your best designer on the scheduler Proposed: reduce associative search Reduce bandwidth Bloom-filtered SQ [Sethumadhavan+ 03] Reduce number of stores searched Pipelined/chained SQ [Park+ 03] Hierarchical/filtered SQ [Srinivasan+ 04, Ghandi+ 05, Torres+ 05] Decomposed SQ [Roth 05; Baugh+ 04] Why not just eliminate associative search? [ 4 ]

5 addr Our Solution addr SQ index predictor D$ addr (CAM) age logic NOT SCALABLE data (RAM) D$ == addr (RAM) data (RAM) data data Replace associative search with indexed access Replace address CAM + age logic with address RAM Predict one SQ index per load Predictor (e.g., Store Sets) is not on load critical path Keep address match allow false positives boost accuracy [ 5 ]

6 Verification Speculative indexed access requires verification In-order load re-execution prior to commit [Cain+ 04] Store Vulnerability Window (SVW) re-execution filter [Roth 05] + On average 3% of loads re-execute almost free + Works unmodified for speculative indexed SQ (address check) Re-execution + Indexed store queue = CAM-free load/store unit [ 6 ]

7 Outline Introduction and background Forwarding index prediction Mechanism Evaluation Delay index prediction Mechanism Evaluation Conclusion [ 7 ]

8 SSNs: A Naming System for Dynamic Stores SSNs (Store Sequence Numbers) Required for SVW and more convenient than store queue indices + Can name committed stores + Fewer wrap-around issues Monotonically increasing SSN commit : youngest committed store SSN dispatch : youngest dispatched store (SSN commit + SQ.NUM) From SSN to store queue index? If st.ssn > SSN commit, st.index = (st.ssn % SQ.SIZE) [ 8 ]

9 Indexed Forwarding Pipeline Actions Fetch Decode Rename Issue Execute Execute SVW Re-exec Commit address D$ forwarding predictor SSN fwd SQ = MUX data & Decode/rename: predict forwarding store (SSN fwd ) Issue: wait for load address, SSN fwd to execute Unify: SSN fwd used for both forwarding and scheduling Execute: index SQ, forward if address matches SVW/re-execute: verify forwarding Commit: train predictor [ 9 ]

10 Forwarding Index Predictor Decode Rename Load PC FSP PC fwd1 PC fwd2 predictor SAT SSN fwd1 SSN fwd2 MAX SSN fwd Forwarding Store Predictor (FSP) Maps load PC to small set of likely-to-forward store PCs (Load) PC-indexed, set-associative, entry={tag, (2) store PCs} Store Alias Table (SAT) Maps store PC to its most recent store instance (SSN) (Store) PC-indexed, direct-mapped, entry={ssn} SSN fwd : largest SSN (youngest in-flight store) Design inspired by Store Sets [Chrysos+ 98] Used for load scheduling and forwarding [ 10 ]

11 Working Example: Forwarding (Store Part) PC Z: store 8, A PC Z: PC W: store 8, A ld A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit FSP SAT W X Y Z Z Predictor 19 Z 18 A B Y 5 6 B 4 Data cache SSN commit = 17 tail tail head SQ [ 11 ]

12 Working Example: Forwarding (Store Part) PC Z: store 8, A PC Z: PC W: store 8, A ld A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit FSP SAT W X Y Z Z Predictor A B Z Y 5 6 A B 8 4 SQ tail head Data cache SSN commit = 17 [ 12 ]

13 Working Example: Forwarding (Load Part) PC fwd =? PC W: ld A PC Z: PC W: store 8, A ld A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit FSP SAT W X Y Z Z Predictor A B Z Y 5 6 A B 8 4 SQ tail head Data cache SSN commit = 17 [ 13 ]

14 Working Example: Forwarding (Load Part) SSN fwd =? PC fwd = Z PC W: ld A PC Z: PC W: store 8, A ld A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit FSP SAT W X Y Z Z Predictor A B Z Y 5 6 A B 8 4 SQ tail head Data cache SSN commit = 17 [ 14 ]

15 Working Example: Forwarding (Load Part) SSN fwd = 19 PC W: ld A PC Z: PC W: store 8, A ld A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit FSP SAT W X Y Z Z Predictor A B Z Y 5 4 A B 8 4 tail head Data cache SSN commit = 18 SQ [ 15 ]

16 Working Example: Verification (Store Part) PC Z: store 8, A PC W: ld A, 8 PC Z: store 8, A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit FSP SAT W X Y Z Z Predictor 19 Z Y A B 8 4 SQ head tail A B 58 4 Data cache SSN commit = 19 [ 16 ]

17 Working Example: Verification (Load Part) PC Z: store 8, A PC W: ld A, 8 Successful Forwarding PC W: ld A Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit A B FSP SAT Z Y 8 4 W X Y Z A Z B Data cache 8 4 Predictor SQ head tail SSN commit = 19 [ 17 ]

18 Forwarding Index Predictor Training Forwarding Store Predictor (FSP) Train on every load at commit Address-indexed tables track PCs, SSNs of last committed stores if (SSN fwd == committed_store_ssn[load.addr]) correct -> re-inforce else if (load.pc fwd!= committed_store_pc[load.addr]) wrong -> learn new load-store pair else wrong -> unlearn existing store-load pair (later) Store Alias Table (SAT) Not a predictor, analogous to RAT (register alias table) [ 18 ]

19 So Far, How Well Are We Doing? What we care about Forwarding prediction accuracy Performance vs. associative SQ Basic setup Simulator: simplescalar Alpha 8-way superscalar, 19 stage pipe, 512-entry ROB Benchmarks: SPEC2000, Mediabench (only show 9 benchmarks) Important Parameters 64KB D$: 3 cycles 64-entry SQ: associative 3 & 5 cycles, indexed 3 cycles CACTI 3.2: 90nm, 1.1V, 3GHz FSP: 4K-entry, 2-way Bigger than Store Sets: all dependences, not just violations Probably OK: PC-indexed, in-order front-end, only 8KB [ 19 ]

20 Forwarding Index Prediction Accuracy 100 Percent of loads lucas jpeg.e perl.s apsi gsm.e eon.c vortex wupwise mesa.t 99.5%+ accuracy (better than branch prediction), why? ~88% loads don t forward Address check these ~88% are right by design ~12% loads forward Forwarding patterns are stable [Moshovos+, 97] [ 20 ]

21 First Things First: Baseline Performance Relative execution time lucas jpeg.e perl.s apsi gsm.e eon.c vortex wupwise mesa.t Baseline: 3 cycle associative SQ and perfect scheduling + (slightly modified) Store Sets scheduling Store sets is basically a perfect scheduler + 5 cycle associative SQ (forwarding triggers replays) Latency/replays cause extra 1-10% slowdown [ 21 ]

22 Indexed Forwarding Performance Relative execution time lucas jpeg.e perl.s apsi gsm.e eon.c vortex wupwise mesa.t 3 cycle indexed SQ (with unified scheduling) + Forwarding accuracy high outperforms 5-cycle associative Forwarding accuracy low slowdowns [ 22 ]

23 Sources of Low Forwarding Prediction Accuracy FSP limitation Our FSP: 2 stores per load eon.c and vortex gsm.e eon.c vortex wupwise mesa.t Not-most-recent forwarding Example:for (;;) { X[J] = A * X[J 2]; } WhenJ = 5, loadx[5 2] depends on storex[3] but SAT tracks the most recent store instance (X[4]) mesa.t [ 23 ]

24 Delay Index Prediction for Difficult Loads Uniform solution to low forwarding prediction accuracy Convert flush (really bad) to scheduling delay (less bad) Delay load until uncertain stores commit, get value from cache Fetch Decode Rename Issue Execute W-back SVW Re-exec Commit address D$ forwarding predictor SSN fwd SQ = MUX data delay predictor SSN delay & Decode/rename: predict + delay store (SSN delay ) Issue: wait for + SSN delay to commit [ 24 ]

25 Delay Index Predictor Decode Rename Load PC FSP SAT MAX DDP Distance Delay delay predictor Delay Distance Predictor (DDP) SSN dispatch SUB Maps load PC to distance (in stores) to forwarding store PC-indexed, set-associative table, entry = {tag, distance} SSN fwd SSN delay SSN delay = SSN dispatch Distance delay Example: not-most-recent forwarding (forj=5) SSN dispatch (X[4]) Distance delay (1) SSN delay (X[3]) Inspired by Exclusive Collision Predictor [Yoaz+ 99] See our paper for training [ 25 ]

26 Indexed Forwarding (with Delay) Performance Relative execution time lucas jpeg.e perl.s apsi gsm.e eon.c vortex wupwise mesa.t [ 26 ]

27 Indexed Forwarding (with Delay) Performance Relative execution time lucas jpeg.e perl.s apsi gsm.e eon.c vortex wupwise mesa.t 3 cycle indexed SQ + delay Forwarding prediction is good: not much impact (sometimes ±1%) + FSP 2-store limitation: a few short delays ± Not-most-recent forwarding: many long delays Better than flushing but not much [ 27 ]

28 Performance Summary Relative execution time lucas jpeg.e perl.s apsi gsm.e eon.c vortex wupwise mesa.t 3 cycle indexed SQ + delay + On average 3% slower than 3 cycle associative SQ (black) + On average as fast as 5-cycle associative SQ (blue) [ 28 ]

29 Problem Conclusion Scalability of store-load forwarding Approach Keep age-ordered store queue Replace associative search with indexed access Adapted Store-Sets predictor predicts forwarding index Adapted Exclusive Collision predictor delays difficult loads Effectiveness + As fast as realistic 5 cycle associative SQ + But simpler: unified scheduler, no forwarding replays, Indexed store queue + load re-execution = CAM free in-flight load/store unit [ 29 ]

30 The End Thank you! [ 30 ]

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide) Out-of-order Pipeline Buffer of instructions Issue = Select + Wakeup Select N oldest, read instructions N=, xor N=, xor and sub Note: ma have execution resource constraints: i.e., load/store/fp Fetch Decode

More information

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example This Unit: Scheduling (Static + Dnamic) CIS 50 Computer Architecture Unit 8: Static and Dnamic Scheduling Application OS Compiler Firmware CPU I/O Memor Digital Circuits Gates & Transistors! Previousl:!

More information

CSCI-564 Advanced Computer Architecture

CSCI-564 Advanced Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA

More information

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I. Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

Unit 6: Branch Prediction

Unit 6: Branch Prediction CIS 501: Computer Architecture Unit 6: Branch Prediction Slides developed by Joe Devie/, Milo Mar4n & Amir Roth at Upenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi,

More information

EE 660: Computer Architecture Out-of-Order Processors

EE 660: Computer Architecture Out-of-Order Processors EE 660: Computer Architecture Out-of-Order Processors Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Based on the slides of Prof. David entzlaff Agenda I4 Processors I2O2

More information

CPSC 3300 Spring 2017 Exam 2

CPSC 3300 Spring 2017 Exam 2 CPSC 3300 Spring 2017 Exam 2 Name: 1. Matching. Write the correct term from the list into each blank. (2 pts. each) structural hazard EPIC forwarding precise exception hardwired load-use data hazard VLIW

More information

Fall 2011 Prof. Hyesoon Kim

Fall 2011 Prof. Hyesoon Kim Fall 2011 Prof. Hyesoon Kim Add: 2 cycles FE_stage add r1, r2, r3 FE L ID L EX L MEM L WB L add add sub r4, r1, r3 sub sub add add mul r5, r2, r3 mul sub sub add add mul sub sub add add mul sub sub add

More information

Portland State University ECE 587/687. Branch Prediction

Portland State University ECE 587/687. Branch Prediction Portland State University ECE 587/687 Branch Prediction Copyright by Alaa Alameldeen and Haitham Akkary 2015 Branch Penalty Example: Comparing perfect branch prediction to 90%, 95%, 99% prediction accuracy,

More information

A Detailed Study on Phase Predictors

A Detailed Study on Phase Predictors A Detailed Study on Phase Predictors Frederik Vandeputte, Lieven Eeckhout, and Koen De Bosschere Ghent University, Electronics and Information Systems Department Sint-Pietersnieuwstraat 41, B-9000 Gent,

More information

CIS 371 Computer Organization and Design

CIS 371 Computer Organization and Design CIS 371 Computer Organization and Design Unit 7: Caches Based on slides by Prof. Amir Roth & Prof. Milo Martin CIS 371: Comp. Org. Prof. Milo Martin Caches 1 This Unit: Caches I$ Core L2 D$ Main Memory

More information

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference) ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC

More information

CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS KUNXIANG YAN, B.S. A thesis submitted to the Graduate School

CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS KUNXIANG YAN, B.S. A thesis submitted to the Graduate School CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS BY KUNXIANG YAN, B.S. A thesis submitted to the Graduate School in partial fulfillment of the requirements for the degree Master

More information

/ : Computer Architecture and Design

/ : Computer Architecture and Design 16.482 / 16.561: Computer Architecture and Design Summer 2015 Homework #5 Solution 1. Dynamic scheduling (30 points) Given the loop below: DADDI R3, R0, #4 outer: DADDI R2, R1, #32 inner: L.D F0, 0(R1)

More information

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT CSE 560 Practice Problem Set 4 Solution 1. In this question, you will examine several different schemes for branch prediction, using the following code sequence for a simple load store ISA with no branch

More information

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 2. (2 )What are the two main ways to define performance? 3. (2 )What

More information

Performance Prediction Models

Performance Prediction Models Performance Prediction Models Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, Scott Mahlke {shrupad,lukefahr,reetudas,mahlke}@umich.edu Gem5 workshop Micro 2012 December 2, 2012 Composite Cores Pushing

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting

More information

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018 ECE 172 Digital Systems Chapter 12 Instruction Pipelining Herbert G. Mayer, PSU Status 7/20/2018 1 Syllabus l Scheduling on Pipelined Architecture l Idealized Pipeline l Goal of Scheduling l Causes for

More information

3. (2) What is the difference between fixed and hybrid instructions?

3. (2) What is the difference between fixed and hybrid instructions? 1. (2 pts) What is a "balanced" pipeline? 2. (2 pts) What are the two main ways to define performance? 3. (2) What is the difference between fixed and hybrid instructions? 4. (2 pts) Clock rates have grown

More information

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation Jingyang Zhu 1, Zhiliang Qian 2, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

Enrico Nardelli Logic Circuits and Computer Architecture

Enrico Nardelli Logic Circuits and Computer Architecture Enrico Nardelli Logic Circuits and Computer Architecture Appendix B The design of VS0: a very simple CPU Rev. 1.4 (2009-10) by Enrico Nardelli B - 1 Instruction set Just 4 instructions LOAD M - Copy into

More information

Microprocessor Power Analysis by Labeled Simulation

Microprocessor Power Analysis by Labeled Simulation Microprocessor Power Analysis by Labeled Simulation Cheng-Ta Hsieh, Kevin Chen and Massoud Pedram University of Southern California Dept. of EE-Systems Los Angeles CA 989 Outline! Introduction! Problem

More information

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control The MIPS Pipeline CSCI206 - Computer Organization & Programming Pipeline Datapath and Control zybook: 11.6 Developed and maintained by the Bucknell University Computer Science Department - 2017 Hazard

More information

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic Computer Science 324 Computer Architecture Mount Holyoke College Fall 2007 Topic Notes: Digital Logic Our goal for the next few weeks is to paint a a reasonably complete picture of how we can go from transistor

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19

More information

COVER SHEET: Problem#: Points

COVER SHEET: Problem#: Points EEL 4712 Midterm 3 Spring 2017 VERSION 1 Name: UFID: Sign here to give permission for your test to be returned in class, where others might see your score: IMPORTANT: Please be neat and write (or draw)

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. fetch decode & eg-fetch execute

More information

Vector Lane Threading

Vector Lane Threading Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program

More information

Department of Electrical and Computer Engineering The University of Texas at Austin

Department of Electrical and Computer Engineering The University of Texas at Austin Department of Electrical and Computer Engineering The University of Texas at Austin EE 360N, Fall 2004 Yale Patt, Instructor Aater Suleman, Huzefa Sanjeliwala, Dam Sunwoo, TAs Exam 1, October 6, 2004 Name:

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

Lecture: Pipelining Basics

Lecture: Pipelining Basics Lecture: Pipelining Basics Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4:

More information

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY Daniel Sanchez and Christos Kozyrakis Stanford University MICRO-43, December 6 th 21 Executive Summary 2 Mitigating the memory wall requires large, highly

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] What are the two main ways to define performance? [2] Predicting the direction of a branch is not enough. What else is necessary? [2] The power consumed by a chip has increased over time, but the clock

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. I fetch decode & eg-fetch execute memory Clock period

More information

ICS 233 Computer Architecture & Assembly Language

ICS 233 Computer Architecture & Assembly Language ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by

More information

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2 Pipelining CS 365 Lecture 12 Prof. Yih Huang CS 365 1 Traditional Execution 1 2 3 4 1 2 3 4 5 1 2 3 add ld beq CS 365 2 1 Pipelined Execution 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

More information

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, B. Javadi, H. Honda, K. Inoue, K. Murakami Faculty

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering TIMING ANALYSIS Overview Circuits do not respond instantaneously to input changes

More information

Profile-Based Adaptation for Cache Decay

Profile-Based Adaptation for Cache Decay Profile-Based Adaptation for Cache Decay KARTHIK SANKARANARAYANAN and KEVIN SKADRON University of Virginia Cache decay is a set of leakage-reduction mechanisms that put cache lines that have not been accessed

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE

AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE CARNEGIE MELLON UNIVERSITY AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for the degree

More information

CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II

CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II CSE241 VLSI Digital Circuits Winter 2003 Lecture 07: Timing II CSE241 L3 ASICs.1 Delay Calculation Cell Fall Cap\Tr 0.05 0.2 0.5 0.01 0.02 0.16 0.30 0.5 2.0 0.04 0.32 0.178 0.08 0.64 0.60 1.20 0.1ns 0.147ns

More information

Memory, Latches, & Registers

Memory, Latches, & Registers Memory, Latches, & Registers 1) Structured Logic Arrays 2) Memory Arrays 3) Transparent Latches 4) How to save a few bucks at toll booths 5) Edge-triggered Registers L13 Memory 1 General Table Lookup Synthesis

More information

Computer Architecture

Computer Architecture Lecture 2: Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture CPU Evolution What is? 2 Outline Measurements and metrics : Performance, Cost, Dependability, Power Guidelines

More information

CIS 371 Computer Organization and Design

CIS 371 Computer Organization and Design CIS 371 Computer Organization and Design Unit 13: Power & Energy Slides developed by Milo Mar0n & Amir Roth at the University of Pennsylvania with sources that included University of Wisconsin slides by

More information

Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors

Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors Qiang Wu Philo Juang Margaret Martonosi Douglas W. Clark Depts. of Computer Science and Electrical Engineering

More information

4. (3) What do we mean when we say something is an N-operand machine?

4. (3) What do we mean when we say something is an N-operand machine? 1. (2) What are the two main ways to define performance? 2. (2) When dealing with control hazards, a prediction is not enough - what else is necessary in order to eliminate stalls? 3. (3) What is an "unbalanced"

More information

Origami: Folding Warps for Energy Efficient GPUs

Origami: Folding Warps for Energy Efficient GPUs Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong, Justin Huang and Murali Annavaram* * University of Southern alifornia University of alifornia, Riverside Stanford University

More information

CSE Computer Architecture I

CSE Computer Architecture I Execution Sequence Summary CSE 30321 Computer Architecture I Lecture 17 - Multi Cycle Control Michael Niemier Department of Computer Science and Engineering Step name Instruction fetch Instruction decode/register

More information

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Raj Parihar Advisor: Prof. Michael C. Huang March 22, 2013 Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

More information

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern

More information

Contents. Chapter 3 Combinational Circuits Page 1 of 36

Contents. Chapter 3 Combinational Circuits Page 1 of 36 Chapter 3 Combinational Circuits Page of 36 Contents Combinational Circuits...2 3. Analysis of Combinational Circuits...3 3.. Using a Truth Table...3 3..2 Using a Boolean Function...6 3.2 Synthesis of

More information

CS 52 Computer rchitecture and Engineering Lecture 4 - Pipelining Krste sanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste! http://inst.eecs.berkeley.edu/~cs52!

More information

Branch Prediction using Advanced Neural Methods

Branch Prediction using Advanced Neural Methods Branch Prediction using Advanced Neural Methods Sunghoon Kim Department of Mechanical Engineering University of California, Berkeley shkim@newton.berkeley.edu Abstract Among the hardware techniques, two-level

More information

ALU A functional unit

ALU A functional unit ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1

More information

BETTER BRANCH PREDICTION THROUGH PROPHET/CRITIC HYBRIDS

BETTER BRANCH PREDICTION THROUGH PROPHET/CRITIC HYBRIDS BETTER BRANCH PREDICTION THROUGH PROPHET/CRITIC HYBRIDS THE PROPHET/CRITIC HYBRID CONDITIONAL BRANCH PREDICTOR HAS TWO COMPONENT PREDICTORS. THE PROPHET USES A BRANCH S HISTORY TO PREDICT ITS DIRECTION.

More information

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College Why analysis? We want to predict how the algorithm will behave (e.g. running time) on arbitrary inputs, and how it will

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN

More information

Automata Theory CS S-12 Turing Machine Modifications

Automata Theory CS S-12 Turing Machine Modifications Automata Theory CS411-2015S-12 Turing Machine Modifications David Galles Department of Computer Science University of San Francisco 12-0: Extending Turing Machines When we added a stack to NFA to get a

More information

Energy Delay Optimization

Energy Delay Optimization EE M216A.:. Fall 21 Lecture 8 Energy Delay Optimization Prof. Dejan Marković ee216a@gmail.com Some Common Questions Is sizing better than V DD for energy reduction? What are the optimal values of gate

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 9

ECE 571 Advanced Microprocessor-Based Design Lecture 9 ECE 571 Advanced Microprocessor-Based Design Lecture 9 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 20 February 2018 Announcements HW#4 was posted. About branch predictors Don

More information

ALU, Latches and Flip-Flops

ALU, Latches and Flip-Flops CSE14: Components and Design Techniques for Digital Systems ALU, Latches and Flip-Flops Tajana Simunic Rosing Where we are. Last time: ALUs Plan for today: ALU example, latches and flip flops Exam #1 grades

More information

Programmable Logic Devices II

Programmable Logic Devices II Lecture 04: Efficient Design of Sequential Circuits Prof. Arliones Hoeller arliones.hoeller@ifsc.edu.br Prof. Marcos Moecke moecke@ifsc.edu.br 1 / 94 Reference These slides are based on the material made

More information

Timing Analysis with Clock Skew

Timing Analysis with Clock Skew , Mark Horowitz 1, & Dean Liu 1 David_Harris@hmc.edu, {horowitz, dliu}@vlsi.stanford.edu March, 1999 Harvey Mudd College Claremont, CA 1 (with Stanford University, Stanford, CA) Outline Introduction Timing

More information

I Can t Believe It s Not Causal! Scalable Causal Consistency with No Slowdown Cascades

I Can t Believe It s Not Causal! Scalable Causal Consistency with No Slowdown Cascades I Can t Believe It s Not Causal! Scalable Causal Consistency with No Slowdown Cascades Syed Akbar Mehdi 1, Cody Littley 1, Natacha Crooks 1, Lorenzo Alvisi 1,4, Nathan Bronson 2, Wyatt Lloyd 3 1 UT Austin,

More information

Boolean Algebra and Digital Logic 2009, University of Colombo School of Computing

Boolean Algebra and Digital Logic 2009, University of Colombo School of Computing IT 204 Section 3.0 Boolean Algebra and Digital Logic Boolean Algebra 2 Logic Equations to Truth Tables X = A. B + A. B + AB A B X 0 0 0 0 3 Sum of Products The OR operation performed on the products of

More information

Leveraging Transactional Memory for a Predictable Execution of Applications Composed of Hard Real-Time and Best-Effort Tasks

Leveraging Transactional Memory for a Predictable Execution of Applications Composed of Hard Real-Time and Best-Effort Tasks Leveraging Transactional Memory for a Predictable Execution of Applications Composed of Hard Real-Time and Best-Effort Tasks Stefan Metzlaff, Sebastian Weis, and Theo Ungerer Department of Computer Science,

More information

Lecture 5 - Assembly Programming(II), Intro to Digital Filters

Lecture 5 - Assembly Programming(II), Intro to Digital Filters GoBack Lecture 5 - Assembly Programming(II), Intro to Digital Filters James Barnes (James.Barnes@colostate.edu) Spring 2009 Colorado State University Dept of Electrical and Computer Engineering ECE423

More information

CS 147: Computer Systems Performance Analysis

CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis Advanced Regression Techniques CS 147: Computer Systems Performance Analysis Advanced Regression Techniques 1 / 31 Overview Overview Overview Common Transformations

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 09/10, Jan., 2018 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 43 Most Essential Assumptions for Real-Time Systems Upper

More information

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk ECE290 Fall 2012 Lecture 22 Dr. Zbigniew Kalbarczyk Today LC-3 Micro-sequencer (the control store) LC-3 Micro-programmed control memory LC-3 Micro-instruction format LC -3 Micro-sequencer (the circuitry)

More information

Control-Theoretic Techniques and Thermal-RC Modeling for Accurate and Localized Dynamic Thermal Management

Control-Theoretic Techniques and Thermal-RC Modeling for Accurate and Localized Dynamic Thermal Management Control-Theoretic Techniques and Thermal-RC Modeling for Accurate and Localized ynamic Thermal Management UNIV. O VIRGINIA EPT. O COMPUTER SCIENCE TECH. REPORT CS-2001-27 Kevin Skadron, Tarek Abdelzaher,

More information

Lecture 34: Portable Systems Technology Background Professor Randy H. Katz Computer Science 252 Fall 1995

Lecture 34: Portable Systems Technology Background Professor Randy H. Katz Computer Science 252 Fall 1995 Lecture 34: Portable Systems Technology Background Professor Randy H. Katz Computer Science 252 Fall 1995 RHK.F95 1 Technology Trends: Microprocessor Capacity 100000000 10000000 Pentium Transistors 1000000

More information

A Mathematical Solution to. by Utilizing Soft Edge Flip Flops

A Mathematical Solution to. by Utilizing Soft Edge Flip Flops A Mathematical Solution to Power Optimal Pipeline Design by Utilizing Soft Edge Flip Flops M. Ghasemazar, B. Amelifard, M. Pedram University of Southern California Department of Electrical Engineering

More information

Reducing NVM Writes with Optimized Shadow Paging

Reducing NVM Writes with Optimized Shadow Paging Reducing NVM Writes with Optimized Shadow Paging Yuanjiang Ni, Jishen Zhao, Daniel Bittman, Ethan L. Miller Center for Research in Storage Systems University of California, Santa Cruz Emerging Technology

More information

Numbers and Arithmetic

Numbers and Arithmetic Numbers and Arithmetic See: P&H Chapter 2.4 2.6, 3.2, C.5 C.6 Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University Big Picture: Building a Processor memory inst register file alu

More information

ENEE350 Lecture Notes-Weeks 14 and 15

ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl s Law ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining is a method of processing in which a problem is divided into a number of sub problems and solved and the solu8ons of the sub problems

More information

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on

More information

Scalability is Quantifiable

Scalability is Quantifiable Scalability is Quantifiable Universal Scalability Law Baron Schwartz - November 2017 Logistics & Stuff Slides will be posted :) Ask questions anytime! Founder of VividCortex Wrote High Performance MySQL

More information

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests: One sided tests So far all of our tests have been two sided. While this may be a bit easier to understand, this is often not the best way to do a hypothesis test. One simple thing that we can do to get

More information

C.K. Ken Yang UCLA Courtesy of MAH EE 215B

C.K. Ken Yang UCLA Courtesy of MAH EE 215B Decoders: Logical Effort Applied C.K. Ken Yang UCLA yang@ee.ucla.edu Courtesy of MAH 1 Overview Reading Rabaey 6.2.2 (Ratio-ed logic) W&H 6.2.2 Overview We have now gone through the basics of decoders,

More information

Fast Path-Based Neural Branch Prediction

Fast Path-Based Neural Branch Prediction Fast Path-Based Neral Branch Prediction Daniel A. Jiménez http://camino.rtgers.ed Department of Compter Science Rtgers, The State University of New Jersey Overview The context: microarchitectre Branch

More information

Large-Scale Behavioral Targeting

Large-Scale Behavioral Targeting Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting

More information

Fundamentals of Computer Systems

Fundamentals of Computer Systems Fundamentals of Computer Systems Review for the Final Stephen A. Edwards Columbia University Summer 25 The Final 2 hours 8 problems Closed book Simple calculators are OK, but unnecessary One double-sided

More information

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5. CHPTER 9 2008 Pearson Education, Inc. 9-. log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C 7 Z = F 7 + F 6 + F 5 + F 4 + F 3 + F 2 + F + F 0 N = F 7 9-3.* = S + S = S + S S S S0 C in C 0 dder

More information

Number of solutions of a system

Number of solutions of a system Roberto s Notes on Linear Algebra Chapter 3: Linear systems and matrices Section 7 Number of solutions of a system What you need to know already: How to solve a linear system by using Gauss- Jordan elimination.

More information

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining CMU 18-447 Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining Instructor: Prof Onur Mutlu TAs: Rachata Ausavarungnirun, Kevin Chang, Albert Cho, Jeremie

More information

Computer Architecture ELEC2401 & ELEC3441

Computer Architecture ELEC2401 & ELEC3441 Last Time Pipeline Hazard Computer Architecture ELEC2401 & ELEC3441 Lecture 8 Pipelining (3) Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Structural Hazard Hazard Control

More information

Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions

Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions In the Proceedings of the International Conference on Dependable Systems and Networks(DSN 07), June 2007. Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions Xiaodong Li,

More information

Formal Online Methods for Voltage/Frequency Control in Multiple Clock Domain Microprocessors

Formal Online Methods for Voltage/Frequency Control in Multiple Clock Domain Microprocessors Formal Online Methods for Voltage/Frequency Control in Multiple Clock Domain Microprocessors Qiang Wu Philo Juang Margaret Martonosi Douglas W. Clark Dept. of Computer Science, Dept. of Electrical Engineering

More information

Distributed Systems Part II Solution to Exercise Sheet 10

Distributed Systems Part II Solution to Exercise Sheet 10 Distributed Computing HS 2012 Prof. R. Wattenhofer / C. Decker Distributed Systems Part II Solution to Exercise Sheet 10 1 Spin Locks A read-write lock is a lock that allows either multiple processes to

More information

A Novel Meta Predictor Design for Hybrid Branch Prediction

A Novel Meta Predictor Design for Hybrid Branch Prediction A Novel Meta Predictor Design for Hybrid Branch Prediction YOUNG JUNG AHN, DAE YON HWANG, YONG SUK LEE, JIN-YOUNG CHOI AND GYUNGHO LEE The Dept. of Computer Science & Engineering Korea University Anam-dong

More information

Synchronous Reactive Systems

Synchronous Reactive Systems Synchronous Reactive Systems Stephen Edwards sedwards@synopsys.com Synopsys, Inc. Outline Synchronous Reactive Systems Heterogeneity and Ptolemy Semantics of the SR Domain Scheduling the SR Domain 2 Reactive

More information

Skew-Tolerant Circuit Design

Skew-Tolerant Circuit Design Skew-Tolerant Circuit Design David Harris David_Harris@hmc.edu December, 2000 Harvey Mudd College Claremont, CA Outline Introduction Skew-Tolerant Circuits Traditional Domino Circuits Skew-Tolerant Domino

More information

Models of Computation, Recall Register Machines. A register machine (sometimes abbreviated to RM) is specified by:

Models of Computation, Recall Register Machines. A register machine (sometimes abbreviated to RM) is specified by: Models of Computation, 2010 1 Definition Recall Register Machines A register machine (sometimes abbreviated M) is specified by: Slide 1 finitely many registers R 0, R 1,..., R n, each capable of storing

More information

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Yuchun Ma* Zhuoyuan Li* Jason Cong Xianlong Hong Glenn Reinman Sheqin Dong* Qiang Zhou *Department of Computer Science &

More information