Portland State University ECE 587/687. Branch Prediction

Similar documents
Unit 6: Branch Prediction

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

/ : Computer Architecture and Design

Branch Prediction using Advanced Neural Methods

A Novel Meta Predictor Design for Hybrid Branch Prediction

Fall 2011 Prof. Hyesoon Kim

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

CPSC 3300 Spring 2017 Exam 2

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

ICS 233 Computer Architecture & Assembly Language

Fast Path-Based Neural Branch Prediction

CSCI-564 Advanced Computer Architecture

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

An Alternative TAGE-like Conditional Branch Predictor

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

BETTER BRANCH PREDICTION THROUGH PROPHET/CRITIC HYBRIDS

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

ENEE350 Lecture Notes-Weeks 14 and 15

[2] Predicting the direction of a branch is not enough. What else is necessary?

EE 660: Computer Architecture Out-of-Order Processors

Potentials of Branch Predictors from Entropy Viewpoints

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

4. (3) What do we mean when we say something is an N-operand machine?

A Detailed Study on Phase Predictors

Árpád Gellért Lucian N. Vinţan Adrian Florea. A Systematic Approach to Predict Unbiased Branches

A framework for the timing analysis of dynamic branch predictors

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Exploiting Bias in the Hysteresis Bit of 2-bit Saturating Counters in Branch Predictors

[2] Predicting the direction of a branch is not enough. What else is necessary?

3. (2) What is the difference between fixed and hybrid instructions?

Computer Architecture ELEC2401 & ELEC3441

Power-Aware Branch Prediction: Characterization and Design

Lecture 9: Control Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

CMP N 301 Computer Architecture. Appendix C

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions

Department of Electrical and Computer Engineering The University of Texas at Austin

Models of Computation, Recall Register Machines. A register machine (sometimes abbreviated to RM) is specified by:

Lecture: Pipelining Basics

AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University

System Data Bus (8-bit) Data Buffer. Internal Data Bus (8-bit) 8-bit register (R) 3-bit address 16-bit register pair (P) 2-bit address

Part I: Definitions and Properties

Scalable Store-Load Forwarding via Store Queue Index Prediction

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Lecture 13: Sequential Circuits, FSM

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

Implementing the Controller. Harvard-Style Datapath for DLX

Fast Path-Based Neural Branch Prediction

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

Limits to Branch Prediction

CS 700: Quantitative Methods & Experimental Design in Computer Science

Branch History Matching: Branch Predictor Warmup for Sampled Simulation

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

MIXED-SIGNAL APPROXIMATE COMPUTATION: ANEURAL PREDICTOR CASE STUDY

Fall 2008 CSE Qualifying Exam. September 13, 2008

Automata Theory CS S-12 Turing Machine Modifications

Lecture 13: Sequential Circuits, FSM

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

Counters. We ll look at different kinds of counters and discuss how to build them

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Project Two RISC Processor Implementation ECE 485

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Low-Power, High-Performance Analog Neural Branch Prediction

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

ECS 120 Lesson 23 The Class P

Figure 4.9 MARIE s Datapath

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

CMPT-150-e1: Introduction to Computer Design Final Exam

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

CpSc 421 Final Exam December 15, 2006

Lecture 5 - Assembly Programming(II), Intro to Digital Filters

Computer Architecture

Enrico Nardelli Logic Circuits and Computer Architecture

ECE/CS 250: Computer Architecture. Basics of Logic Design: Boolean Algebra, Logic Gates. Benjamin Lee

UNIVERSITY OF WISCONSIN MADISON

Compiling Techniques

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design Boolean Algebra, Logic Gates

CSE477 VLSI Digital Circuits Fall Lecture 20: Adder Design

ECE/CS 250 Computer Architecture

Digital Logic. CS211 Computer Architecture. l Topics. l Transistors (Design & Types) l Logic Gates. l Combinational Circuits.

Timeline of a Vulnerability

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

Computer Science. Questions for discussion Part II. Computer Science COMPUTER SCIENCE. Section 4.2.

ECE 341. Lecture # 3

CMP 334: Seventh Class

Introduction to Formal Languages, Automata and Computability p.1/42

VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

Wormhole: Wisely Predicting Multidimensional Branches

Performance Prediction Models

Introduction The Nature of High-Performance Computation

ECE 571 Advanced Microprocessor-Based Design Lecture 9

Latches. October 13, 2003 Latches 1

Measurement & Performance

Measurement & Performance

On my honor, as an Aggie, I have neither given nor received unauthorized aid on this academic work

Virtualization. Introduction. G. Lettieri A/A 2014/15. Dipartimento di Ingegneria dell Informazione Università di Pisa. G. Lettieri Virtualization

Transcription:

Portland State University ECE 587/687 Branch Prediction Copyright by Alaa Alameldeen and Haitham Akkary 2015

Branch Penalty Example: Comparing perfect branch prediction to 90%, 95%, 99% prediction accuracy, and to no branch prediction Processor has a 15-stage 6-wide pipeline, incorrectly predicted branch leads to pipeline flush Program can have an average of 4 instructions retire per cycle, has 100,000 conditional branches out of 1 million instructions Perfect BP: IPC = 1,000,000/250,000 = 4.00 90% BP accuracy: 1/10 branches incorrectly predicted IPC = 1,000,000/(250,000 + 0.1x100,000x15) = 2.5 (60% slower) 95% BP accuracy: 1/20 branches incorrectly predicted IPC = 1,000,000/(250,000 + 0.05x100,000x15) = 3.08 (30% slower) 99% BP accuracy: 1/100 branches incorrectly predicted IPC = 1,000,000/(250,000 + 0.01x100,000x15) = 3.77 (6% slower) No BP: Fetch stalled until branch is resolved (4 pipeline stages) IPC = 1,000,000/(250,000 + 100,000x4) = 1.53 (160% slower) Portland State University ECE 587/687 Spring 2015 2

Reducing Branch Costs with Dynamic Branch prediction basics: Hardware Prediction We need to predict conditional branch outcome to select the address for next instruction fetch PC + 4 Or branch target address Also we need to quickly determine the branch target address Direct branches Register indirect branches Returns Portland State University ECE 587/687 Spring 2015 3

Predicting Conditional Branch Outcomes Simplest dynamic branch prediction scheme uses a branch-prediction buffer or branch history table Small memory indexed by the lower portion of the branch address Stores previous branch outcomes to predict next outcome Table is not tagged: Prediction may have been put in the entry by a different branch (Aliasing) PC PC [11:2] 1K entries prediction buffer Portland State University ECE 587/687 Spring 2015 4

Predicting Conditional Branch Outcomes 1-bit prediction buffer stores the last executed branch outcome, and uses it to predict the next outcome If bit = 1, branch is predicted taken If bit = 0, branch is predicted not-taken A simple 1-bit scheme may not perform well Example: Below is a series of branch outcomes and corresponding predictions : outcomes 1111011110111101 predictions 111101111011110 mispredictions 111101111011110 Portland State University ECE 587/687 Spring 2015 5

Predicting Conditional Branch Outcomes 2-bit saturating counter often used Branch taken ==> increment state Max state 11 stays at 11 when incremented Branch not-taken ==> decrement state Min state 00 stays at 00 when decremented 11 and 10 are predict taken states 00 and 01 are predict not-taken states Portland State University ECE 587/687 Spring 2015 6

2-bit Saturating Counter State Machine T Predict taken 11 T NT Predict taken 10 T NT NT Predict not taken 00 T NT Predict not taken 01 Portland State University ECE 587/687 Spring 2015 7

Predicting Conditional Branch Outcomes Assuming initial state to be 11, i.e., 3, branch outcomes and corresponding predictions now look as follows: outcomes 1111011110111101 states 333323333233332 predictions 111111111111111 mispredictions 111111111111111 Portland State University ECE 587/687 Spring 2015 8

Correlating Branch Predictors 2-bit prediction schemes use the recent behavior of a single branch to predict the future behavior of that branch Behavior of longer sequence of branch execution history often provides more accurate prediction outcome Behavior of other branches rather than just the branch we are trying to predict is sometimes important Because outcomes of different branches often correlate Global branch history For some branches, prior history execution of the branch is important Because of loops Local branch history Portland State University ECE 587/687 Spring 2015 9

Correlating Branch Predictors: Code Example DSUBUI R3,R1,#2 if (aa == 2) BNEZ R3,L1 aa = 0; DADD R1,R0,R0 if (bb == 2) L1: DSUBUI R3,R2,#2 bb = 0; BNEZ R3,L2 if (aa!= bb) { DADD R2,R0,R0 L2: DSUBU R3,R1,R2 BEQZ R3,L3 Portland State University ECE 587/687 Spring 2015 10

Correlating Branch Predictor with 2-bit Global History Register Portland State University ECE 587/687 Spring 2015 11

Two-Level Adaptive Branch Two main structures Prediction Branch History Register (BHR) or Branch History Table (BHT) Pattern History Table (PHT) Basic structure of the branch predictor: Yeh&Patt Figure 1 Updating predictions using automatons: Yeh&Patt Figure 2 Three different flavors (Yeh&Patt Figure 3) Global History Register and Global Pattern History Table (GAg) Per-address Branch History Table and Global Pattern History Table (PAg) Per-address Branch History Table and Per-address Pattern History Tables (PAp) Portland State University ECE 587/687 Spring 2015 12

Two-Level Adaptive Branch Prediction: Discussion Cost-effectiveness of three flavors GAg has too much branch interference, needs long history PAp needs lots of space for Per-address PHT PAg is the most cost-effective Context switch GAg almost unaffected PAg, PAp degraded Pros and cons for saving branch history on a context switch? Portland State University ECE 587/687 Spring 2015 13

Other Branch Prediction Strategies McFarling s Paper: Bimodal Predictor: Figure 1 PAg and GAg: Figure 4 and 6 Global Predictor with Index Selection: Figure 8 Global History with Index Sharing (GShare): Figure 10 Using perceptrons instead of 2-bit saturating counters Jimenez&Lin s paper (skim, not in exam) Provides higher prediction accuracy Portland State University ECE 587/687 Spring 2015 14

Adaptively Combining Branch Predictors Some branches are predicted more accurately with global predictors Other branches are predicted better with local predictors It is possible to combine both types of predictors, and dynamically select the right predict for the right branch The selector is yet another predictor with 2-bit state machine per entry Portland State University ECE 587/687 Spring 2015 15

State of the Art Branch Prediction: TAGE TAgged GEometric history length branch prediction (Seznec & Michaud,JILP 2006 skim) Variations of TAGE won the last three branch prediction championships (CBP) History is tagged Avoids aliasing (branch predicted using another branch s history) History length is geometric Hard-to-predict branches use longer history than more predictable branches Paper Figure 1 Portland State University ECE 587/687 Spring 2015 16

Branch Target Buffer (BTB) A cache that stores branch targets Accessed by the address of the instruction currently fetched Allows branch target to be read in the IF stage When a branch is predicted taken, the fetch of the instruction at the branch target address can proceed immediately in the next cycle Stall cycles that would have been needed to wait for the decoding of the branch and the computation of the target are saved Portland State University ECE 587/687 Spring 2015 17

Branch Target Buffer Portland State University ECE 587/687 Spring 2015 18

Predicting Return Address Using Return Address Stack (RAS) Indirect branches have multiple potential targets, since address comes from a register, which can have many possible values Branch target buffers could be used for indirect branch target prediction However, many mispredictions can happen because the BTB can store only one target per branch Most indirect branches come from return instructions Portland State University ECE 587/687 Spring 2015 19

Return Address Stack A small address buffer organized as a stack When a Call is encountered, the Return address (which is Call address + 4) is pushed onto the RAS When a Return instruction is encountered, the address from the top of the RAS is popped and used as the target Portland State University ECE 587/687 Spring 2015 20

Reading Assignment Wednesday Simplescalar technical report (Read) Tutorial (Skim) Daniel Jimenez and Calvin Lin, Dynamic Branch Prediction with Perceptrons, HPCA 2001 (Skim) Andre Seznec and Pierre Michaud, A Case for (Partially) TAgged GEometric History Length Branch Prediction, JILP 2006 (Skim) No reviews due Monday Eric Rotenberg et al., Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching, MICRO 1996 (Review) Portland State University ECE 587/687 Spring 2015 21