Fast Path-Based Neural Branch Prediction

Similar documents
Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Portland State University ECE 587/687. Branch Prediction

Fast Path-Based Neural Branch Prediction

Branch Prediction using Advanced Neural Methods

Topics: A multiple cycle implementation. Distributed Notes

Review. Combined Datapath

Designing Single-Cycle MIPS Processor

Designing MIPS Processor

Concepts Introduced. Digital Electronics. Logic Blocks. Truth Tables

Instruction register. Data. Registers. Register # Memory data register

Chapter 4 Supervised learning:

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions

On the circuit complexity of the standard and the Karatsuba methods of multiplying integers

A Novel Meta Predictor Design for Hybrid Branch Prediction

Unit 6: Branch Prediction

BETTER BRANCH PREDICTION THROUGH PROPHET/CRITIC HYBRIDS

Lecture 9: Control Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

Applying Fuzzy Set Approach into Achieving Quality Improvement for Qualitative Quality Response

Lecture Notes On THEORY OF COMPUTATION MODULE - 2 UNIT - 2

Outline. Model Predictive Control: Current Status and Future Challenges. Separation of the control problem. Separation of the control problem

Discontinuous Fluctuation Distribution for Time-Dependent Problems

Sequential Classification Algorithms

System identification of buildings equipped with closed-loop control devices

NEURAL CONTROLLERS FOR NONLINEAR SYSTEMS IN MATLAB

Control Performance Monitoring of State-Dependent Nonlinear Processes

Linear System Theory (Fall 2011): Homework 1. Solutions

Study on the impulsive pressure of tank oscillating by force towards multiple degrees of freedom

Bayes and Naïve Bayes Classifiers CS434

Chapter 3 MATHEMATICAL MODELING OF DYNAMIC SYSTEMS

Linear and Nonlinear Model Predictive Control of Quadruple Tank Process

CPU DESIGN The Single-Cycle Implementation

FRTN10 Exercise 12. Synthesis by Convex Optimization

Fast Algorithms for Restoration of Color Wireless Capsule Endoscopy Images

Chapter 4 Linear Models

Introdction Finite elds play an increasingly important role in modern digital commnication systems. Typical areas of applications are cryptographic sc

Optimal Control of a Heterogeneous Two Server System with Consideration for Power and Performance

LambdaMF: Learning Nonsmooth Ranking Functions in Matrix Factorization Using Lambda

PREDICTABILITY OF SOLID STATE ZENER REFERENCES

Fall 2011 Prof. Hyesoon Kim

1. Tractable and Intractable Computational Problems So far in the course we have seen many problems that have polynomial-time solutions; that is, on

Computer Architecture 10. Fast Adders

Problem Class 4. More State Machines (Problem Sheet 3 con t)

Department of Industrial Engineering Statistical Quality Control presented by Dr. Eng. Abed Schokry

Simulation Based Analysis of Two Different Control Strategies for PMSM

Online Solution of State Dependent Riccati Equation for Nonlinear System Stabilization

The spreading residue harmonic balance method for nonlinear vibration of an electrostatically actuated microbeam

Control of a Power Assisted Lifting Device

Reducing Conservatism in Flutterometer Predictions Using Volterra Modeling with Modal Parameter Estimation

Hybrid modelling and model reduction for control & optimisation

III. Demonstration of a seismometer response with amplitude and phase responses at:

Electron Phase Slip in an Undulator with Dipole Field and BPM Errors

Convergence analysis of ant colony learning

Gen Hebb Learn PPaplinski Generalized Hebbian Learning and its pplication in Dimensionality Redction Illstrative Example ndrew P Paplinski Department

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

New Fourth Order Explicit Group Method in the Solution of the Helmholtz Equation Norhashidah Hj. Mohd Ali, Teng Wai Ping

Exploiting Bias in the Hysteresis Bit of 2-bit Saturating Counters in Branch Predictors

A New Approach for Small Satellite Gyroscope and Star Tracker Fusion

A Model-Free Adaptive Control of Pulsed GTAW

Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Study on the Mathematic Model of Product Modular System Orienting the Modular Design

TrustSVD: Collaborative Filtering with Both the Explicit and Implicit Influence of User Trust and of Item Ratings

An Investigation into Estimating Type B Degrees of Freedom

EXPT. 5 DETERMINATION OF pk a OF AN INDICATOR USING SPECTROPHOTOMETRY

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

Árpád Gellért Lucian N. Vinţan Adrian Florea. A Systematic Approach to Predict Unbiased Branches

BLOOM S TAXONOMY. Following Bloom s Taxonomy to Assess Students

Universal Scheme for Optimal Search and Stop

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Branch History Matching: Branch Predictor Warmup for Sampled Simulation

FINITE ELEMENT MODELING OF EDDY CURRENT PROBES FOR EDGE EFFECT

Data-Efficient Control Policy Search using Residual Dynamics Learning

Optimization in Predictive Control Algorithm

Development of Second Order Plus Time Delay (SOPTD) Model from Orthonormal Basis Filter (OBF) Model

[2] Predicting the direction of a branch is not enough. What else is necessary?

Ted Pedersen. Southern Methodist University. large sample assumptions implicit in traditional goodness

Multi-Voltage Floorplan Design with Optimal Voltage Assignment

Quantum Key Distribution Using Decoy State Protocol

FOUNTAIN codes [3], [4] provide an efficient solution

Setting The K Value And Polarization Mode Of The Delta Undulator

UNCERTAINTY FOCUSED STRENGTH ANALYSIS MODEL

CMP 334: Seventh Class

[2] Predicting the direction of a branch is not enough. What else is necessary?

Creating a Sliding Mode in a Motion Control System by Adopting a Dynamic Defuzzification Strategy in an Adaptive Neuro Fuzzy Inference System

where v ij = [v ij,1,..., v ij,v ] is the vector of resource

Technical Note. ODiSI-B Sensor Strain Gage Factor Uncertainty

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Solving a System of Equations

Math 116 First Midterm October 14, 2009

Simplified Identification Scheme for Structures on a Flexible Base

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

/ : Computer Architecture and Design

arxiv: v2 [cs.dc] 2 Apr 2016

Low-Power, High-Performance Analog Neural Branch Prediction

VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

APPENDIX B MATRIX NOTATION. The Definition of Matrix Notation is the Definition of Matrix Multiplication B.1 INTRODUCTION

1. State-Space Linear Systems 2. Block Diagrams 3. Exercises

Sareban: Evaluation of Three Common Algorithms for Structure Active Control

Assignment Fall 2014

State Space Models Basic Concepts

10.4 Solving Equations in Quadratic Form, Equations Reducible to Quadratics

Transcription:

Fast Path-Based Neral Branch Prediction Daniel A. Jiménez http://camino.rtgers.ed Department of Compter Science Rtgers, The State University of New Jersey

Overview The context: microarchitectre Branch prediction Neral branch prediction The problem: It's too slow! The soltion: A path-based neral branch predictor Reslts and analysis Conclsions 2

The Context I'll be discssing the implementation of microprocessors Microarchitectre I stdy deeply pipelined, high clock freqency CPUs The goal is to improve performance Make the program go faster How can we exploit program behavior to make it go faster? Remove control dependences Increase instrction-level parallelism 3

How an Instrction is Processed Processing can be divided into several stages: Instrction fetch Instrction decode Execte Memory access Write back 4

Instrction-Level Parallelism To speed p the process, pipelining overlaps exection of mltiple instrctions, exploiting parallelism between instrctions Instrction fetch Instrction decode Execte Memory access Write back 5

Control Hazards: Branches Conditional branches create a problem for pipelining: the next instrction can't be fetched ntil the branch has exected, several stages later. Branch instrction 6

Pipelining and Branches Pipelining overlaps instrctions to exploit parallelism, allowing the clock rate to be increased. Branches case bbbles in the pipeline, where some stages are left idle. Instrction fetch Instrction decode Execte Memory access Write back Unresolved branch instrction 7

Branch Prediction A branch predictor allows the processor to speclatively fetch and execte instrctions down the predicted path. Instrction fetch Instrction decode Execte Memory access Write back Speclative exection Branch predictors mst be highly accrate to avoid mispredictions! 8

Branch Predictors Mst Improve The cost of a misprediction is proportional to pipeline depth As pipelines deepen, we need more accrate branch predictors Pentim 4 pipeline has 20 stages Ftre pipelines will have > 32 stages Deeper pipelines allow higher clock rates by decreasing the delay of each pipeline stage Decreasing misprediction rate from 9% to 4% reslts in 31% speedp for 32 stage pipeline Simlations with SimpleScalar/Alpha 9

Branch Prediction Backgrond The basic mechanism: 2-level adaptive prediction [Yeh & Patt `91] Uses correlations between branch history and otcome Examples: gshare [McFarling `93] agree [Sprangle et al. `97] hybrid predictors [Evers et al. `96] This scheme is highly accrate in practice 10

Neral Branch Prediction Observed that branch prediction is a machine learning problem The perceptron predictor [Jiménez & Lin 2001 (HPCA), 2003 (TOCS)] A novel branch predictor based on neral learning Able to exploit longer histories than most 2-level schemes High accracy, bt high delay Overriding was proposed as a soltion to the delay problem, bt it does not scale [Jiménez, 2003 (HPCA)] 11

Branch-Predicting Perceptron Inpts (x s) are from branch history register Weights (w s) are small integers learned by on-line training Otpt (y) gives prediction; dot prodct of x s and w s Training finds correlations between history and otcome w 0 is the bias weight, learning only the bias of the branch 12

Prediction Algorithm h is the history length W[0..n-1,0..h] is a table of perceptrons (weights vectors) Weights are 8-bit integers W[i,0..h] is the i'th perceptron G[1..h] is a global history shift register 13

Update Algorithm Strengthens correlations between branch history and otcome Highly parallel 14

Perceptron Predictor Accracy Very accrate 15

The Problem: It's Too Slow! Example otpt comptation: 12 weights, Wallace tree of depth 6 followed by 14-bit carry-lookahead adder Carry-save adders have O(1) depth, carry-lookahead adder has O(log n) depth Delay can be 7 cycles for long histories 16

Impact of Delay Even sing latency-mitigation techniqe, performance sffers 32-stage pipeline, 8 FO4 clock period, overriding 2K-entry bimodal predictor 17

Soltion: A Path-Based Neral Predictor Instead of compting the prediction all at once... Stagger the comptation in time, sing weights fond along the path to the branch being predicted [Jiménez 2003 (MICRO)] 18

Intitive Description Neron for branch b t has weights x 0 throgh x 7 for both predictors original perceptron predictor path-based neral predictor 19

Algorithm Overview Components of algorithm Terms Prediction Speclative Update Non-speclative pdate / training h, n, W as before SR is a shift vector of ints that accmlates partial sms SR[j] holds the sm for the (h-j)th branch in the ftre SG is the speclative global history R and G are non-speclative versions of SR and SG 20

Prediction Algorithm i = branch PC mod n y = SR[h] + W[i,0], i.e., final comptation in dot prodct if y >= 0 predict taken, otherwise predict not taken 21

Speclative Update Algorithm Update each partial sm in SR in parallel: for j in 1..h in parallel do SR'[h-j+1] = SR[h-j] + prediction? W[i,j] : -W[i,j] end for SR := SR' Speclatively pdate SG with prediction 22

Non-Speclative Update / Training Non-speclatively pdate R sing otcome Let H be the history sed to predict this branch Train the bias weight for this branch: If otcome = taken, W[i,0]++ else W[i,0]-- Train the rest of the weights based on correlation with history: for j in 1..h in parallel do let k j be the vale of i, j branches ago if H[j] = otcome then W[k j,j]++ else W[k j,j]-- end for 23

Experimental Evalation Used a modified version of SimpleScalar/Alpha 3.0 8-wide processor, 32-stage pipeline Used HSPICE + CACTI 3.0 for delay estimates Compared against overriding versions of: 2Bc-gskew [Seznec et al. '02] Fixed-length path predictor [Stark et al. '98] Global/local perceptron predictor [Jiménez & Lin '03] Also sed single-cycle pipelined gshare [McFarling '93],[Jiménez '03] Measred misprediction rates and instrctions per cycle (IPC) All 12 SPECint2000 + 5 from '95 not dplicated in 2000 24

Reslts: Accracy Path-based neral predictor is the most accrate (of corse) 25

Reslts: IPC Path-based neral predictor yields best performance 26

IPC Per Benchmark: 8KB Bdget Best on 15 of 17 benchmarks 27

Analysis: Linear Separability Perceptrons can learn well only linearly separable fnctions Bt half of all branches are linearly inseparable! 28

Analysis cont. Almost all mispredictions come from inseparable branches! Path-based neral predictor can predict these branches well n = path based neral p = perceptron g = gshare history length = 10 29

Conclsions & Ftre Work Neral branch predictors are a viable technology Neral predictors offer performance beyond traditional conter-based schemes Design space for path-based neral predictors can be frther explored, e.g.: Global/local Static/dynamic Overriding with partial sms Applications beyond branch prediction... 30

The End 31

Prediction Algorithm SR[0..h] is a vector of h+1 integers that hold speclative partial sms Think of SR[] as a pipeline of partial sms; zeros go in and the dot prodct of the weights and history bits come ot (mins the bias weights) 32

Update Algorithm Maintains non-speclative partial sms for misprediction recovery Increments or decrements weights corresponding to nerons for positive or negative correlation 33

History Length Tning Tned predictors for optimal history length Path-based histories were shorter than perceptron 34

Delay Estimates Used HSPICE and CACTI 3.0 to estimate latencies for predictors Based on 90 nm technology, aggressive 8 fan-ot-of-4 inverter delays 35

Intitive Description cont. Weights are chosen ahead of time, based on the path leading to branch b t (x 0 is the bias weight) 36

Neral Branch Predictors Are Feasible Most branch predictors are based on tables of two-bit conters Neral predictors se a sperior prediction technology Accracy is better than table-based approaches However, latency of comptation makes them impractical We propose a new algorithm for compting neral prediction Almost all work is done ahead of time, so latency is greatly improved Incorporates path information, so accracy is improved Yields speedp of 16% over perceptron predictor, 4% over 2Bc-gskew 37

Branch Prediction is a Machine Learning Problem So why not apply a machine learning algorithm? Replace 2-bit conters with a more accrate predictor Tight constraints on prediction mechanism Mst be fast and small enogh to work as a component of a microprocessor Artificial neral networks Simple model of neral networks in brain cells Learn to recognize and classify patterns Most neral nets are slow and complex relative to tables For branch prediction, we need a small and fast neral method 38

Accracy Per Benchmark: 8KB Bdget Best on 14 of the 17 benchmarks, not conting perceptron predictor 39