CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Similar documents
Unit 6: Branch Prediction

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

Portland State University ECE 587/687. Branch Prediction

Fall 2011 Prof. Hyesoon Kim

ICS 233 Computer Architecture & Assembly Language

[2] Predicting the direction of a branch is not enough. What else is necessary?

4. (3) What do we mean when we say something is an N-operand machine?

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

[2] Predicting the direction of a branch is not enough. What else is necessary?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

CMP N 301 Computer Architecture. Appendix C

/ : Computer Architecture and Design

3. (2) What is the difference between fixed and hybrid instructions?

Counters. We ll look at different kinds of counters and discuss how to build them

Project Two RISC Processor Implementation ECE 485

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

CPSC 3300 Spring 2017 Exam 2

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

Logic Design II (17.342) Spring Lecture Outline

Branch Prediction using Advanced Neural Methods

Lecture 13: Sequential Circuits, FSM

ENEE350 Lecture Notes-Weeks 14 and 15

CPU DESIGN The Single-Cycle Implementation

Latches. October 13, 2003 Latches 1

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Computer Architecture

Lecture 13: Sequential Circuits, FSM

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

CSCI-564 Advanced Computer Architecture

Lecture: Pipelining Basics

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur

A Novel Meta Predictor Design for Hybrid Branch Prediction

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

COVER SHEET: Problem#: Points

Ch 7. Finite State Machines. VII - Finite State Machines Contemporary Logic Design 1

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Design of Sequential Circuits

Implementing the Controller. Harvard-Style Datapath for DLX

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

CMP 338: Third Class

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Part I: Definitions and Properties

Lecture 8: Sequential Networks and Finite State Machines

PAST EXAM PAPER & MEMO N3 ABOUT THE QUESTION PAPERS:

Computer Architecture. ECE 361 Lecture 5: The Design Process & ALU Design. 361 design.1

Models of Computation, Recall Register Machines. A register machine (sometimes abbreviated to RM) is specified by:

CS 700: Quantitative Methods & Experimental Design in Computer Science

CSE370 HW6 Solutions (Winter 2010)

Table of Content. Chapter 11 Dedicated Microprocessors Page 1 of 25

ww.padasalai.net

COE 202: Digital Logic Design Sequential Circuits Part 4. Dr. Ahmad Almulhem ahmadsm AT kfupm Phone: Office:

Automata Theory CS S-12 Turing Machine Modifications

ELCT201: DIGITAL LOGIC DESIGN

Sorting. CMPS 2200 Fall Carola Wenk Slides courtesy of Charles Leiserson with changes and additions by Carola Wenk

Sequential Logic Worksheet

Performance Prediction Models

EE40 Lec 15. Logic Synthesis and Sequential Logic Circuits

Department of Electrical and Computer Engineering The University of Texas at Austin

DE58/DC58 LOGIC DESIGN DEC 2014

Models for representing sequential circuits

Synchronous Sequential Circuit

EECS150 - Digital Design Lecture 25 Shifters and Counters. Recap

Exam for Physics 4051, October 31, 2008

Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc.

Computers also need devices capable of Storing data and information Performing mathematical operations on such data

10/12/2016. An FSM with No Inputs Moves from State to State. ECE 120: Introduction to Computing. Eventually, the States Form a Loop

Lecture 10: Synchronous Sequential Circuits Design

Review: Designing with FSM. EECS Components and Design Techniques for Digital Systems. Lec09 Counters Outline.

EECS150 - Digital Design Lecture 21 - Design Blocks

Read this before starting!

Sorting. CS 3343 Fall Carola Wenk Slides courtesy of Charles Leiserson with small changes by Carola Wenk

6.111 Lecture # 12. Binary arithmetic: most operations are familiar Each place in a binary number has value 2 n

Microprocessor Power Analysis by Labeled Simulation

Design at the Register Transfer Level

3. Complete the following table of equivalent values. Use binary numbers with a sign bit and 7 bits for the value

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

Physics 9 WS M5 (rev. 1.0) Page 1

Linear Feedback Shift Registers (LFSRs) 4-bit LFSR

Verilog HDL:Digital Design and Modeling. Chapter 11. Additional Design Examples. Additional Figures

Computer Architecture ELEC2401 & ELEC3441

CSE 4502/5717 Big Data Analytics Spring 2018; Homework 1 Solutions

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Synchronous Sequential Circuit Design. Digital Computer Design

Unit 8: Sequ. ential Circuits

LABORATORY MANUAL MICROPROCESSOR AND MICROCONTROLLER

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Different encodings generate different circuits

Lecture 5 - Assembly Programming(II), Intro to Digital Filters

CSE 140 Spring 2017: Final Solutions (Total 50 Points)

Memory Elements I. CS31 Pascal Van Hentenryck. CS031 Lecture 6 Page 1

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Catie Baker Spring 2015

A Second Datapath Example YH16

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

DSP Configurations. responded with: thus the system function for this filter would be

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

hexadecimal-to-decimal conversion

CPE100: Digital Logic Design I

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

CS1800: Hex & Logic. Professor Kevin Gold

Transcription:

CSE 560 Practice Problem Set 4 Solution 1. In this question, you will examine several different schemes for branch prediction, using the following code sequence for a simple load store ISA with no branch delay slot (i.e., the branch instruction effects the immediately following instruction): PC1 PC2 PC3 top: skip1: skip2: r2 r0, #45 r3 r1, #6 r4 #10000 ; initialize r2 to 101101, binary ; initialize r6 to 6,, decimal ; initialize r4 to a big number andi r1 r2, #1 ; extract the low order bit of r2 bnez r1, skip1 if the bit is set ;branch xor r0 r0, r0 ; dummy instruction srli r2 r1, #1 ; shift the pattern in r2 subi r3 r3, 1 ; decrement r3 bnez r3, skip2 r2 r0, #45 ; reinitialize pattern r3 r0, #6 subi r4 r4, 1 ; decrement loop counter bnez r4, top This sequence contains 3 branches, labeled PC1, PC2,, and PC3. The following are the steady state taken/not taken patterns for each of these branches (T indicates taken, N indicates not taken): PC1: PC2: PC3: TNTTNT TNTTNT TTTTTN TTTTTN TTTTTT TTTTTT Here is one branch predictor structure:

Table A is indexed by the program counter. Table B contains two bit saturating up down counters as described in class (state values N, n, t, T); they predict taken when the counter value is t or T. You can assume that PC1, PC2, and PC3 map to distinct slots in Table A. (a) What is the prediction success rate (that is, the ratio of correctly predicted branches to total branches) for each of the three branches (PC1, PC2, and PC3) using this predictor, in steadystate, after the loop has been running for many iterations? If you want, you may assume that all counters and histories are initialized to zero, but this should not impact your answer. The correct answer is: PC1: 100% PC2: 67% PC3: 100% To determine the above, we first determine the steady state prediction of the relevant counters in table B. The following is the history of the branches: PC1: T N T T N T T N T T N T T N T T N T PC2: T T T T T N T T T T T N T T T T T N PC3: T T T T T T T T T T T T T T T T T T Imagine a sliding window, illustrated by the box above. At some point during steady state operation, the branch history for PC1 will be TNT, for PC2 will be TTT, and for PC3 TTT. This is illustrated by the solid box, above. At this point, the next set of branch patterns is given by the next column of the chart, or T for PC1, T for PC2, and T for PC3. So we build a table that holds the branch behavior for each entry in Table B. For example, PC1 s history is TNT, and the branch is taken, so we put a T in entry 101 (decimal 5) of the table, etc.: 000 001 010 011 100 101 T 110 111 TT We now slide the window one branch to the right, indicated by the dotted box, and update the table again. This continues until we hit a repeating pattern (after we enter 18 branches into the table). The result is shown below, with the repeat pattern in italics:

000 001 010 011 NNT NNT 100 101 TTT TTT 110 TTT TTT 111 TTTTNTTTT TTTTNTTTT We now imagine each of these patterns being fed into a 2 bit saturating counter, and we observe that counter 011 will always predict not taken, counter 101 will predict taken, counter 110 will predict taken, and counter 111 will predict taken in steady state. Going back to the original branch patterns, we see that PC1 goes through the patterns TNT, NTT, TTN, with next branch behavior T, N, T, respectively. These correspond to counters 101, 011, and 110, which predict T, N, and T, respectively. These predictions match the branch behavior exactly, so PC1 is predicted with 100% accuracy. PC2 goes through the states TTT, TTT, TTT, TTN, TNT, and NTT, with next branch behavior T, T, N, T, T, T. The states correspond to entries 111, 111, 111, 110, 101, and 011, and thus predictions T, T, T, T, T, and N respectively. The third and sixth predictions don t match the actual branch behavior, so 2 out of 6 branches are mispredicted, so PC2 has a prediction accuracy of 4/6 or 67%. Finally, PC3 only has one pattern, TTT, and the next branch behavior is always T. It is thus predicted correctly 100% of the time, since entry 111 predicts T. (b) What is the overall prediction rate for the given code sequence on this predictor? Above, we saw that the branch pattern repeats every eighteen branches. Of those eighteen, two are mispredicted (the two mispredicts of PC2). Thus, the overall prediction rate is 16/18, or 89%. (c) Now let s consider another approach to branch prediction. Below is the structure of a predictor that uses 3 global history bits and a table of 2 bit saturating up down counters.

What is the overall prediction rate for the given code sequence on this predictor? We proceed as in part (a), but now we look at a global window. Here is the global branch pattern: T T T N T T T T T T T T N T T T N T This time, our sliding history windoww (the solid box) only covers three branches. In this case, with a global pattern of TTT, the next branch encountered is not taken. So we can build another table as in part (a); the dotted box shows how the window advances from the first position. After the pattern repeats (18 branches) ), the table looks like: 000 001 010 011 100 101 110 111 TTT TTT TTT TTT TTT TTT NTTTTTNNT NTTTTTNNT Counters 011, 101, and 110 will saturate to T. Counter 111 is trickier. Let s assume that the counter starts off with a value of 0 ( i.e., strong predict not taken, or N). Other starting values will produce the same steady state prediction, an exercise for the reader. Then, the counter evolves as follows: Counter value: N Prediction: N Branch behavior: N N n t T T T t n N N T T T T T N T T T T T N N T t n t T T N T T N T T T T T T t n T T T T N T T N N T t n t T N T N T T

So, what s the overall prediction rate? Of the 18 branches in the above table, those in entries 011, 101, and 110 will always predict correctly. This accounts for 9 of the branches. The other 9 branches correspond to those in the box above (for entry 111). We see that only four of those 9 entries are predicted correctly (the third, fourth, fifth, and sixth). So, out of the eighteen branches, 9 + 4 = 13 are predicted correctly, giving us a 72% overall prediction rate.

2. Use the following RISC code fragment: loop: load r1, 0(r2) (1) r1 r1, #1 (2) store 0(r2), r1 (3) r2 r2, #4 (4) sub r4 r3, r2 (5) bnez r4, loop (6) next instruction (7) You may assume that the initial value of r3 is r2 + 396. (a) Show the timing of this instruction sequence (i.e., draw a pipeline diagram) for our 5 stage pipeline without any forwarding or bypassing hardware but assuming a register read and a write in the same clock cycle forwards through the register file. Assume that the branch is handled by flushing the pipeline (use the notation f in the pipeline diagram). If all memory references hit in the cache, how many cycles does this loop take to execute? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 (1) F D X M W (2) F D d* d* X M W (3) F p* p* D d* d* X M W (4) F p* p* D X M W (5) F D d* d* X M W (6) F p* p* D d* d* X M W (7) F p* p* D f* (1) F D 17 1 cycles = 16 cycles to execute (b) Show the timing of this instruction sequence for our 5 stage pipeline with normal forwarding and bypassing hardware. Assume that the branch is handled by predicting it as not taken. If all memory references hit in the cache, how many cycles does this loop take to execute? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 (1) F D X M W (2) F D d* X M W (3) F p* D X M W (4) F D X M W (5) F D X M W (6) F D X M W (7) F D f* (1) F D 10 1 cycles = 9 cycles to execute