Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Similar documents
Unit 6: Branch Prediction

Portland State University ECE 587/687. Branch Prediction

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

CSCI-564 Advanced Computer Architecture

CPSC 3300 Spring 2017 Exam 2

EE 660: Computer Architecture Out-of-Order Processors

3. (2) What is the difference between fixed and hybrid instructions?

ECE 571 Advanced Microprocessor-Based Design Lecture 10

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

ECE 571 Advanced Microprocessor-Based Design Lecture 9

Processor Design & ALU Design

UNIVERSITY OF WISCONSIN MADISON

Scalable Store-Load Forwarding via Store Queue Index Prediction

Announcements. Project #1 grades were returned on Monday. Midterm #1. Project #2. Requests for re-grades due by Tuesday

ICS 233 Computer Architecture & Assembly Language

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 352 Digital System Fundamentals.

Fall 2011 Prof. Hyesoon Kim

ALU A functional unit

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Simple Instruction-Pipelining (cont.) Pipelining Jumps

4. (3) What do we mean when we say something is an N-operand machine?

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

Simple Instruction-Pipelining. Pipelined Harvard Datapath

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Introduction The Nature of High-Performance Computation

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic

A Detailed Study on Phase Predictors

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

Computer Architecture ELEC2401 & ELEC3441

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

[2] Predicting the direction of a branch is not enough. What else is necessary?

CIS 371 Computer Organization and Design

GATE 2014 A Brief Analysis (Based on student test experiences in the stream of CS on 1 st March, Second Session)

CMP N 301 Computer Architecture. Appendix C

[2] Predicting the direction of a branch is not enough. What else is necessary?

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Lecture 13: Sequential Circuits, FSM

Microprocessor Power Analysis by Labeled Simulation

COVER SHEET: Problem#: Points

Energy-Efficient Management of Reconfigurable Computers 82. Processor caches are critical components of the memory hierarchy that exploit locality to

Adders allow computers to add numbers 2-bit ripple-carry adder

6.830 Lecture 11. Recap 10/15/2018

CSE Computer Architecture I

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Branch Prediction using Advanced Neural Methods

EECS150 - Digital Design Lecture 24 - Arithmetic Blocks, Part 2 + Shifters

Compiling Techniques

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

L07-L09 recap: Fundamental lesson(s)!

Lecture 34: Portable Systems Technology Background Professor Randy H. Katz Computer Science 252 Fall 1995

Caches in WCET Analysis

CMP 338: Third Class

CS/ECE 252: INTRODUCTION TO COMPUTER ENGINEERING COMPUTER SCIENCES DEPARTMENT UNIVERSITY OF WISCONSIN-MADISON

Implementing the Controller. Harvard-Style Datapath for DLX

Project Two RISC Processor Implementation ECE 485

211: Computer Architecture Summer 2016

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

Lecture 13: Sequential Circuits, FSM

A Novel Meta Predictor Design for Hybrid Branch Prediction

Timing analysis and timing predictability

Material Covered on the Final

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

University of Guelph School of Engineering ENG 2410 Digital Design Fall There are 7 questions, answer all questions.

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

Department of Electrical and Computer Engineering The University of Texas at Austin

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

CPU DESIGN The Single-Cycle Implementation

EE141-Fall 2010 Digital Integrated Circuits. Announcements. An Intel Microprocessor. Bit-Sliced Design. Class Material. Last lecture.

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

BETTER BRANCH PREDICTION THROUGH PROPHET/CRITIC HYBRIDS

Logic Design. CS 270: Mathematical Foundations of Computer Science Jeremy Johnson

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

CMP 334: Seventh Class

9. Datapath Design. Jacob Abraham. Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017

Adders, subtractors comparators, multipliers and other ALU elements

UCSD CSE 21, Spring 2014 [Section B00] Mathematics for Algorithm and System Analysis

Circuit Theory ES3, EE21

ENEE350 Lecture Notes-Weeks 14 and 15

Circuit Modeling for Practical Many-core Architecture Design Exploration

CSE140: Components and Design Techniques for Digital Systems. Midterm Information. Instructor: Mohsen Imani. Sources: TSR, Katz, Boriello & Vahid

University of Toronto Faculty of Applied Science and Engineering Final Examination

Computer Architecture 10. Fast Adders

Introduction to Computer Engineering. CS/ECE 252, Fall 2012 Prof. Guri Sohi Computer Sciences Department University of Wisconsin Madison

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

CprE 281: Digital Logic

Overview. Arithmetic circuits. Binary half adder. Binary full adder. Last lecture PLDs ROMs Tristates Design examples

For smaller NRE cost For faster time to market For smaller high-volume manufacturing cost For higher performance

Transcription:

Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm Exam 1 Wednesday, October 25, 2017 Instructions: 1. Open book/open notes. 2. You may use a calculator, but no phones or laptops. 3. You must show your complete work. Points will be awarded only based on what appears in your answers. 4. Failure to follow instructions may result in forfeiture of your exam and will be handled according to UWS 14 Academic misconduct procedures. Problem Type Points Score 1-20 Buzzword Bingo, 2 pts each 40 21 Replacement Policies 20 22 TAGE vs Neural Branch Prediction 20 23 Register Renaming 20 Total 100 ECE/CS 752 Fall 2017 Midterm 1 -- Page 1

Problems 1-20: (40 pts): Buzzword Bingo; match each definition to its buzzword Definition Buzzword 1. _V_ Keeps prefetches from polluting the A. MSHR cache 2. _I_ Very accurate for a branch that is B. HPS always taken 3. _S_ Tells the front end to fetch instructions C. Stride Prefetching from the call site 4. _M_ These enable fast recovery from branch D. Future File mispredictions 5. _C_ Hardware technique that reads blocks E. Physical Register File from memory before they are needed 6. _X_ Enables precise exceptions F. Wakeup logic 7. L,Y_ Prevents the instruction cache from delivering the peak number of instructions per access G. Select logic 8. _G_ Picks which of multiple ready H. Local branch history instructions can issue next 9. _N_ Replacement policy that evicts block I. Bimodal predictor with next reference furthest in the future 10. _A_ Keeps track of a pending cache miss J. Global branch history 11. _E,R Used in the RS/6000 (RIOS-I) floating K. Bypass network point unit 12. _J_ Useful for predicting branches that L. Fetch address alignment correlate with other branches 13. _P_ Keeps track of loads that have not yet M. Checkpoints committed 14. _W_ Improvements from integrating RF N. Belady s optimal blocks and accelerators on the die 15. _R_ Tracks the current location of registers O. Store Queue 16. _D_ Another name for rename registers P. Load Queue 17. _B_ Decodes complex VAX instructions into Q. Pseudo-LRU equivalent sequences of micro-ops 18. U,O Needed to match an aliased load with R. RAM Map Table the correct store 19. _F_ Informs dependent instructions that their S. Return address stack operands are ready 20. _Z_ Allows loads to issue before prior store T. Branch color addresses are known U. Store color V. Stream buffer W. More than Moore X. Reorder buffer Y. Flynn s bottleneck Z. Speculative disambiguation ECE/CS 752 Fall 2017 Midterm 1 -- Page 2

21. Replacement Policies [20 pts] a) (10 pts) Assume an 8-way set-associative cache that implements the tree-based pseudo- LRU replacement policy described in lecture. Given the initial pseudo-lru state illustrated below, simulate the trace of references given in the table, recording the PLRU block after each reference. Record the PLRU bit for each node in the tree (0 is up, 1 is down). T M0 M1 B0 B1 B2 B3 Way 0: A Way 1: B Way 2: C Way 3: D Way 4: E Way 5: F Way 6: G Way 7: H Reference T M0 M1 B0 B1 B2 B3 PLRU Block Initial 0 0 1 1 1 0 1 Way 1: B G 0 0 0 1 1 0 1 Way 1: B F 0 0 1 1 1 0 1 Way 1: B C 1 0 1 1 1 0 1 Way 7: H E 0 0 1 1 1 1 1 Way 1: B H 0 0 0 1 1 1 0 Way 1: B G 0 0 0 1 1 1 1 Way 1: B D 1 0 0 1 0 1 1 Way 5: F E 0 0 1 1 0 1 1 Way 1: B ECE/CS 752 Fall 2017 Midterm 1 -- Page 3

b) (10 pts) Given the logic diagram above, with inputs H0 H7 that indicate a hit in Way 0 Way 7 of this set, fill out the truth table for the next-state logic for each bit in the pseudo-lru tree. As shown, if H0 H7 are all zero, there is no reference to this set, and no update to the PLRU state. Input Next State H0 H1 H2 H3 H4 H5 H6 H7 T M0 M1 B0 B1 B2 B3 0 0 0 0 0 0 0 0 T M0 M1 B0 B1 B2 B3 1 0 0 0 0 0 0 0 1 1 M1 1 B1 B2 B3 0 1 0 0 0 0 0 0 1 1 M1 0 B1 B2 B3 0 0 1 0 0 0 0 0 1 0 M1 B0 1 B2 B3 0 0 0 1 0 0 0 0 1 0 M1 B0 0 B2 B3 0 0 0 0 1 0 0 0 0 M0 1 B0 B1 1 B3 0 0 0 0 0 1 0 0 0 M0 1 B0 B1 0 B3 0 0 0 0 0 0 1 0 0 M0 0 B0 B1 B2 1 0 0 0 0 0 0 0 1 0 M0 0 B0 B1 B2 0 ECE/CS 752 Fall 2017 Midterm 1 -- Page 4

22. TAGE vs. Neural Branch Prediction [20 pts] The two competing state-of-the-art branch predictors today are the TAGE and Neural approaches described in lecture. In this problem, you will be comparing the two approaches for prediction in terms of implementation cost and delay. Here, both predictors use multiple identical predictor elements, including a base bimodal (Smith) predictor indexed by branch address, and multiple history tables (T0-Tn) that generate predictions based on varying global branch history length. The tables T0-Tn are indexed with a hash of the global BHR and the branch address. The key difference between these predictors is how the final prediction is determined. As shown below, the TAGE predictor relies on partial tag matches in each table and uses a priority mux to select the prediction from the longest predictor with a matching history. In contrast, the Neural predictor stores a trained multibit response in each predictor entry, and uses a series of adders to generate a final sum, which is compared to a learned threshold. The final prediction (taken or not-taken) shows up on the output at the bottom of each diagram below. TAGE Predictor Neural Predictor To characterize these two predictors, consider the following assumptions and parameters: E b : The number of entries E b in the bimodal table is fixed for both predictors E t : The number of entries E t in each of the tables (T0-Tn) is fixed for both predictors n: There are n tables T0-Tn in both designs The TAGE bimodal table entries are two-bit saturating counters, and are not tagged The TAGE T0-Tn table entries also contain two-bit saturating counters, plus a 10-bit partial tag and a 3-bit confidence counter The Neural bimodal and T0-Tn table entries contain 5-bit learned weights, and no tags The size/area of each table is determined by the total number of bits it stores (capacity) The access delay for each table is twice the square root of its capacity (total #bits) D hash : the delay for each hash function used to access the table D tag : the delay of the tag matching logic in each TAGE T0-Tn table D mux : the delay of each mux in the TAGE design D add : the delay of each adder in the Neural design. The threshold comparator also has the same delay, since it uses an arithmetic adder/subtractor to compare the final sum to the threshold value. ECE/CS 752 Fall 2017 Midterm 1 -- Page 5

(a) Area Analysis [4 pts]: in terms of the parameters described above, write an equation for each predictor that describes the total size of its tables. TAGE: 2xE b + (n+1) x E t x (2+10+3) Neural: 5xE b + (n+1) x E t x 5 (b) Delay analysis [8 pts]: in terms of the parameters described above, derive an equation for the critical delay path for each predictor, given that the inputs (the branch address and global BHR bits) are available at t=0. You can use a max(a,b) operator if your equation needs one. Hint: first identify the critical path through each predictor. Dbimodal-TAGE = 2 x sqrt(2 x Eb) Dbimodal-Neural = 2 x sqrt(5 x Eb) DTAGE = 2 x sqrt(13 x Et) + Dhash+ Dtag DNeural = 2 x sqrt(5 x Et) + Dhash TAGE: max(dbimodal-tage, DTAGE) + (n+1) x Dmux Neural: max(dbimodal-neural, DNeural) + (n+2) x Dadd ECE/CS 752 Fall 2017 Midterm 1 -- Page 6

(c) Delay discussion [4 pts]: Assuming that D add is approximately equal to (D tag + D mux ), which predictor will have a lower access time? Justify your answer. It is not obvious from what is given. We know that DNeural < DTAGE since it is smaller and has no tag match (Dtag), but Neural has (n+2)xdadd (one extra level for the threshold comparison). If ((DTAGE DNeural) > Dadd) then Neural is faster, else TAGE is faster However, if (Dbimodal > DNeural) then TAGE will also be faster (unlikely, unless Eb is very large) (d) Delay optimization [4 pts]: Is there any way you can restructure the logic shown in the diagram for either predictor to reduce the delay without changing its function or accuracy? If so, describe the change you would make and its impact on the total delay. Would this change your conclusion for part (c)? You can use a tree adder for the Neural reduction. Now instead of (n+2) x Dadd, we have log2(n+1) levels of adders plus the threshold check. E.g. for n=7 we have (3+1) x Dadd for Neural, vs. 8 x (Dtag+Dmux) for TAGE, so Neural would be faster here. ECE/CS 752 Fall 2017 Midterm 1 -- Page 7

23. Register Renaming [20 pts] The following set of instructions is to be renamed onto a set of physical registers. The initial register map and free list are given. The first instruction has been done for you. Assume that registers are allocated from the left of the free list, and that freed registers are returned to the right of the free list. a) Show the rename mappings after each instruction has been dispatched by filling out the table below, updating the rename mappings and free list in the table appropriately. [16 pts] Cycle Rename mappings after dispatch Free list at end of Original Instruction R1 R2 R3 R4 R5 R6 R7 R8 dispatch cycle Dispa Comm Initial register mappings P1 P2 P3 P4 P5 P6 P7 P8 P9, P10, P11, P12 tched itted 1 3 R5 <= R1 + R5 P9 P10, P11, P12 1 3 R2 <= mem (R3 + R1) P10 P11, P12 2 4 R1 <= R3 + 4 P11 P12 2 5 R7 <= R1 + R2 P12 <empty> 4 6 R4 <= R4 1 P5 P2,P1 4 6 R4 <= R4 + R8 P2 P1 6 9 R7 <= R4 R7 P1 P7,P4,P5 6 9 R4 <= mem(r1 + R6) P7 P4,P5 8 10 R3 <= R5 R4 P4 P5 b) Does the allocator ever have to stall to wait for registers to be added to the free list? Explain why or why not. (2 pts) No: P5 & P2 are freed in cycle 3, P1 is freed in cycle 4, so alloc in cycle 4 is OK. Similarly, P7 is freed in cycle 5, and P4 & P5 are freed in cycle 6, so alloc in cycles & 8 are OK c) If the design uses RAM map table checkpoints like the MIPS R10000, for up to four branches in flight, how many bits are needed in total for the current map + 4 checkpoints? (2 pts) (1 current table + 4 ckpt tables) x (8 rows/table) x (4 bits/row) = 160 bits ECE/CS 752 Fall 2017 Midterm 1 -- Page 8