Fast Path-Based Neural Branch Prediction

Fast Path-Based Neral Branch Prediction Daniel A. Jiménez http://camino.rtgers.ed Department of Compter Science Rtgers, The State University of New Jersey

Overview The context: microarchitectre Branch prediction Neral branch prediction The problem: It's too slow! The soltion: A path-based neral branch predictor Reslts and analysis Conclsions 2

The Context I'll be discssing the implementation of microprocessors Microarchitectre I stdy deeply pipelined, high clock freqency CPUs The goal is to improve performance Make the program go faster How can we exploit program behavior to make it go faster? Remove control dependences Increase instrction-level parallelism 3

How an Instrction is Processed Processing can be divided into several stages: Instrction fetch Instrction decode Execte Memory access Write back 4

Instrction-Level Parallelism To speed p the process, pipelining overlaps exection of mltiple instrctions, exploiting parallelism between instrctions Instrction fetch Instrction decode Execte Memory access Write back 5

Control Hazards: Branches Conditional branches create a problem for pipelining: the next instrction can't be fetched ntil the branch has exected, several stages later. Branch instrction 6

Pipelining and Branches Pipelining overlaps instrctions to exploit parallelism, allowing the clock rate to be increased. Branches case bbbles in the pipeline, where some stages are left idle. Instrction fetch Instrction decode Execte Memory access Write back Unresolved branch instrction 7

Branch Prediction A branch predictor allows the processor to speclatively fetch and execte instrctions down the predicted path. Instrction fetch Instrction decode Execte Memory access Write back Speclative exection Branch predictors mst be highly accrate to avoid mispredictions! 8

Branch Predictors Mst Improve The cost of a misprediction is proportional to pipeline depth As pipelines deepen, we need more accrate branch predictors Pentim 4 pipeline has 20 stages Ftre pipelines will have > 32 stages Deeper pipelines allow higher clock rates by decreasing the delay of each pipeline stage Decreasing misprediction rate from 9% to 4% reslts in 31% speedp for 32 stage pipeline Simlations with SimpleScalar/Alpha 9

Branch Prediction Backgrond The basic mechanism: 2-level adaptive prediction [Yeh & Patt `91] Uses correlations between branch history and otcome Examples: gshare [McFarling `93] agree [Sprangle et al. `97] hybrid predictors [Evers et al. `96] This scheme is highly accrate in practice 10

Neral Branch Prediction Observed that branch prediction is a machine learning problem The perceptron predictor [Jiménez & Lin 2001 (HPCA), 2003 (TOCS)] A novel branch predictor based on neral learning Able to exploit longer histories than most 2-level schemes High accracy, bt high delay Overriding was proposed as a soltion to the delay problem, bt it does not scale [Jiménez, 2003 (HPCA)] 11

Branch-Predicting Perceptron Inpts (x s) are from branch history register Weights (w s) are small integers learned by on-line training Otpt (y) gives prediction; dot prodct of x s and w s Training finds correlations between history and otcome w 0 is the bias weight, learning only the bias of the branch 12

Prediction Algorithm h is the history length W[0..n-1,0..h] is a table of perceptrons (weights vectors) Weights are 8-bit integers W[i,0..h] is the i'th perceptron G[1..h] is a global history shift register 13

Update Algorithm Strengthens correlations between branch history and otcome Highly parallel 14

Perceptron Predictor Accracy Very accrate 15

The Problem: It's Too Slow! Example otpt comptation: 12 weights, Wallace tree of depth 6 followed by 14-bit carry-lookahead adder Carry-save adders have O(1) depth, carry-lookahead adder has O(log n) depth Delay can be 7 cycles for long histories 16

Impact of Delay Even sing latency-mitigation techniqe, performance sffers 32-stage pipeline, 8 FO4 clock period, overriding 2K-entry bimodal predictor 17

Soltion: A Path-Based Neral Predictor Instead of compting the prediction all at once... Stagger the comptation in time, sing weights fond along the path to the branch being predicted [Jiménez 2003 (MICRO)] 18

Intitive Description Neron for branch b t has weights x 0 throgh x 7 for both predictors original perceptron predictor path-based neral predictor 19

Algorithm Overview Components of algorithm Terms Prediction Speclative Update Non-speclative pdate / training h, n, W as before SR is a shift vector of ints that accmlates partial sms SR[j] holds the sm for the (h-j)th branch in the ftre SG is the speclative global history R and G are non-speclative versions of SR and SG 20

Prediction Algorithm i = branch PC mod n y = SR[h] + W[i,0], i.e., final comptation in dot prodct if y >= 0 predict taken, otherwise predict not taken 21

Speclative Update Algorithm Update each partial sm in SR in parallel: for j in 1..h in parallel do SR'[h-j+1] = SR[h-j] + prediction? W[i,j] : -W[i,j] end for SR := SR' Speclatively pdate SG with prediction 22

Non-Speclative Update / Training Non-speclatively pdate R sing otcome Let H be the history sed to predict this branch Train the bias weight for this branch: If otcome = taken, W[i,0]++ else W[i,0]-- Train the rest of the weights based on correlation with history: for j in 1..h in parallel do let k j be the vale of i, j branches ago if H[j] = otcome then W[k j,j]++ else W[k j,j]-- end for 23

Experimental Evalation Used a modified version of SimpleScalar/Alpha 3.0 8-wide processor, 32-stage pipeline Used HSPICE + CACTI 3.0 for delay estimates Compared against overriding versions of: 2Bc-gskew [Seznec et al. '02] Fixed-length path predictor [Stark et al. '98] Global/local perceptron predictor [Jiménez & Lin '03] Also sed single-cycle pipelined gshare [McFarling '93],[Jiménez '03] Measred misprediction rates and instrctions per cycle (IPC) All 12 SPECint2000 + 5 from '95 not dplicated in 2000 24

Reslts: Accracy Path-based neral predictor is the most accrate (of corse) 25

Reslts: IPC Path-based neral predictor yields best performance 26

IPC Per Benchmark: 8KB Bdget Best on 15 of 17 benchmarks 27

Analysis: Linear Separability Perceptrons can learn well only linearly separable fnctions Bt half of all branches are linearly inseparable! 28

Analysis cont. Almost all mispredictions come from inseparable branches! Path-based neral predictor can predict these branches well n = path based neral p = perceptron g = gshare history length = 10 29

Conclsions & Ftre Work Neral branch predictors are a viable technology Neral predictors offer performance beyond traditional conter-based schemes Design space for path-based neral predictors can be frther explored, e.g.: Global/local Static/dynamic Overriding with partial sms Applications beyond branch prediction... 30

The End 31

Prediction Algorithm SR[0..h] is a vector of h+1 integers that hold speclative partial sms Think of SR[] as a pipeline of partial sms; zeros go in and the dot prodct of the weights and history bits come ot (mins the bias weights) 32

Update Algorithm Maintains non-speclative partial sms for misprediction recovery Increments or decrements weights corresponding to nerons for positive or negative correlation 33

History Length Tning Tned predictors for optimal history length Path-based histories were shorter than perceptron 34

Delay Estimates Used HSPICE and CACTI 3.0 to estimate latencies for predictors Based on 90 nm technology, aggressive 8 fan-ot-of-4 inverter delays 35

Intitive Description cont. Weights are chosen ahead of time, based on the path leading to branch b t (x 0 is the bias weight) 36

Neral Branch Predictors Are Feasible Most branch predictors are based on tables of two-bit conters Neral predictors se a sperior prediction technology Accracy is better than table-based approaches However, latency of comptation makes them impractical We propose a new algorithm for compting neral prediction Almost all work is done ahead of time, so latency is greatly improved Incorporates path information, so accracy is improved Yields speedp of 16% over perceptron predictor, 4% over 2Bc-gskew 37

Branch Prediction is a Machine Learning Problem So why not apply a machine learning algorithm? Replace 2-bit conters with a more accrate predictor Tight constraints on prediction mechanism Mst be fast and small enogh to work as a component of a microprocessor Artificial neral networks Simple model of neral networks in brain cells Learn to recognize and classify patterns Most neral nets are slow and complex relative to tables For branch prediction, we need a small and fast neral method 38

Accracy Per Benchmark: 8KB Bdget Best on 14 of the 17 benchmarks, not conting perceptron predictor 39