DYNAMIC NEURAL BRANCH PREDICTION FUNDAMENTALS

Similar documents
Branch Prediction using Advanced Neural Methods

Fast Path-Based Neural Branch Prediction

Unit 6: Branch Prediction

Portland State University ECE 587/687. Branch Prediction

Unit 8: Introduction to neural networks. Perceptrons

A Novel Meta Predictor Design for Hybrid Branch Prediction

Lecture 4: Perceptrons and Multilayer Perceptrons

Fall 2011 Prof. Hyesoon Kim

Single layer NN. Neuron Model

RANDOM DEGREES OF UNBIASED BRANCHES

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Data Mining Part 5. Prediction

Neural Networks and the Back-propagation Algorithm

Árpád Gellért Lucian N. Vinţan Adrian Florea. A Systematic Approach to Predict Unbiased Branches

In the Name of God. Lecture 11: Single Layer Perceptrons

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Input layer. Weight matrix [ ] Output layer

Linear Discrimination Functions

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

A Detailed Study on Phase Predictors

Simple Neural Nets For Pattern Classification

BETTER BRANCH PREDICTION THROUGH PROPHET/CRITIC HYBRIDS

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

AI Programming CS F-20 Neural Networks

Chapter 2 Single Layer Feedforward Networks

Neural Nets Supervised learning

Multilayer Perceptron

Lecture 7 Artificial neural networks: Supervised learning

Neural Networks (Part 1) Goals for the lecture

Perceptron. (c) Marcin Sydow. Summary. Perceptron

Linear discriminant functions

Lab 5: 16 th April Exercises on Neural Networks

y(x n, w) t n 2. (1)

Integer weight training by differential evolution algorithms

Neural Networks biological neuron artificial neuron 1

Classification with Perceptrons. Reading:

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Neural networks. Chapter 19, Sections 1 5 1

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

Fast Path-Based Neural Branch Prediction

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

9 Classification. 9.1 Linear Classifiers

Neural networks. Chapter 20. Chapter 20 1

CMSC 421: Neural Computation. Applications of Neural Networks

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Potentials of Branch Predictors from Entropy Viewpoints

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Neural Networks

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

Linear & nonlinear classifiers

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Artificial Neural Networks The Introduction

The Perceptron. Volker Tresp Summer 2016

On the Hopfield algorithm. Foundations and examples

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

CS:4420 Artificial Intelligence

Course 10. Kernel methods. Classical and deep neural networks.

The Perceptron Algorithm 1

Lecture 6. Notes on Linear Algebra. Perceptron

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Lecture 5: Logistic Regression. Neural Networks

Artificial Neural Networks

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Machine Learning Lecture 7

The Perceptron. Volker Tresp Summer 2014

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

Introduction to Artificial Neural Networks

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Numerical Learning Algorithms

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Cost/Performance Tradeoff of n-select Square Root Implementations

Artificial Neural Network

Machine Learning Lecture 5

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

[2] Predicting the direction of a branch is not enough. What else is necessary?

18.6 Regression and Classification with Linear Models

From perceptrons to word embeddings. Simon Šuster University of Groningen

Multiclass Classification-1

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Artificial Intelligence

The Perceptron. Volker Tresp Summer 2018

1 Short adders. t total_ripple8 = t first + 6*t middle + t last = 4t p + 6*2t p + 2t p = 18t p

Project Two RISC Processor Implementation ECE 485

Neural networks. Chapter 20, Section 5 1

Nonlinear Classification

Multilayer Perceptrons and Backpropagation

Midterm: CS 6375 Spring 2015 Solutions

Transcription:

CREATIVITATE, INVENTICĂ, ROBOTICĂ DYNAMIC NEURAL BRANCH PREDICTION FUNDAMENTALS Prof. Eng. Lucian N. VINȚAN, PhD. Universitatea Lucian Blaga din Sibiu, Facultatea de Inginerie, Academia de Științe Tehnice din România REZUMAT. Acest articol prezintă bazele teoretice și practice ale predicţiei neuronale a instrucţiunilor de salt condiţionat (branch), în cadrul unor microarhitecturi pipeline superscalare de procesare a instrucţiunilor (microprocesoare superscalare). Utilizarea unor metode ale inteligenţei artificiale și învăţării automate în proiectarea microprocesoarelor moderne poate contribui semnificativ la creșterea performanţelor acestora, printr-un comportament inteligent, senzitiv la context. Adaptarea unor asemenea metode teoretice rafinate la restricţiile implementărilor hardware constituie o adevărată artă, care combină cunoștinţele teoretice și practice, cu intuiţia și talentul cercetătorului. Numai prin asemenea abordări mature, dar și inspirate, ingineria calculatoarelor poate migra, de la statutul actual de știinţă empirică, la cel de știinţă matură, solidă teoretic. Cuvinte cheie: Perceptron, arhitectura sistemelor de calcul, microprocesoare, predicţia instrucţiunilor de salt condiţionat, predictoare hibride, meta-predicţie. ABSTRACT. This paper presents the theoretical and practical foundations of dynamic neural branch prediction techniques, implemented in modern superscalar microarchitectures. Using refined machine learning methods in processor design could significantly improve the CPU performances through an intelligent context sensitive behaviour. Adapting such powerful theoretical methods to hardware restrictions represents a true art, involving not only refined theoretical and practical methods, but also researcher s intuition and talent. Only by fertilizing Computer Architecture domain with powerful theoretical methods it would become sometime a mature science, and not just an empirical one, as it is today. Keywords: Perceptron, Computer Architecture, Microprocessors, (Dynamic Neural) Branch Prediction, Hybrid Branch Predictors, Meta-Prediction. 1. THE NEURAL BRANCH PREDICTION GENESIS As the average instruction issue rate and depth of the pipeline structures implemented in multipleinstruction-issue microprocessors increase, accurate dynamic branch prediction becomes more essential. Very high prediction accuracy is required, because an increasing number of instructions are lost before a branch misprediction can be corrected during the so-called recovery process. As a result, even a misprediction rate of a few percent involves a substantial performance loss [1, 7]. During the real-time prediction process, branches must be detected within the dynamic instruction stream, and both the branch s direction (taken or not taken) and the branch s target address must be predicted during the instruction fetch stage. All of the above must be completed in time to fetch instructions from the branch target address without interrupting the flow of new dynamic instructions to the processor s pipeline. The requirement for higher branch prediction accuracy in multiple-instructionissue systems led to a dramatic breakthrough in the early 90s with high branch prediction success rates. 64 These high prediction accuracies were obtained using a set of prediction techniques collectively known as Two-Level Adaptive Branch Prediction that were mainly developed by Professor Yale Patt s group [10]. Such simplified Markovian predictors are currently implemented in commercial superscalar microprocessors. Figure 1 presents a generic Two Level Adaptive Branch Predictor scheme, where PCL represent the Program Counter s low bits, GHR = the Global History Register (global correlation; it contains the outcomes of the previous k branches), LHR = Local History Register (per branch local correlation; it contains the outcomes of the previous j instances of the current branch), FSM Pred. = the Branch Predictor implemented as a simple Finite State Machine, PHT = Prediction History Table, PT = Prediction Table, t/nt = taken/not taken transitions, T/NT = taken/not taken states. After the branch is resolved, the branch s outcome correspondingly updates GHR, LHR and FSM Pred. structures. The complexity of such a scheme is exponential, involving tables with 2n respectively 2m entries. In order to reduce this huge complexity, hashing between GHR and LHR is frequently used. Buletinul AGIR nr. 1/2016 ianuarie-martie

DYNAMIC NEURAL BRANCH PREDICTION FUNDAMENTALS Details about the branch prediction process implemented in modern superscalar microprocessors are given in [1, 7, 9]. predictors the learning convergence is mathematically proved. The initial neural branch prediction idea was much further and significantly developed, especially by Dr. Daniel Jiménez [2, 3] and, subsequently, by many others computer architecture researchers, from both The simplest perceptron is a single cell one, like it is presented in the Figure 2. X 0 =1 w 0 X 1 X 2 w 1 w 2 Y w n X n Fig. 1. Generic Two-Level Adaptive Branch Predictor. Dynamic branch prediction with neural methods, as an alternative to the Two Level Adaptive Branch Prediction model, was firstly introduced by L. Vinţan [5]. In this work it was shown that a Learning Vector Quantization-based branch predictor and a Multi- Layer Perceptron Branch predictor might be effective. One of the main research objectives was to use neural networks in order to identify new correlations that can be exploited by branch predictors. An important advantage is that, in contrast with the previous branch prediction paradigm, neural branch prediction can exploit deeper correlations at linear complexities rather than exponential complexities, as the Two- Level Adaptive Branch Prediction schemes complexities. Additionally, the learning process of these supervised neural predictors mathematically converges, as we will show further in this work. In contrast, the learning algorithm of the Two Level Adaptive Branch Prediction schemes is developed in a purely empirical manner. In other words, the FSM Pred. is designed based on a statistical empirical intuition of branches behaviour and not on a rigorous mathematical model, like the supervised neural algorithms. For example, related to the FSM Predictor (see Figure 1), there is no mathematical justification why if the predictor s current state is the weak Taken one (the second T state) and the Taken prediction was wrong (in fact, the branch was Not Taken), the automata transition (nt) is going in the weak Not Taken state (first NT state). Someone could ask: why it is not going in the hard Not Taken state? A justification could be given only based on some profiling information (statistical branch s behaviour), after benchmarking. A mathematical proof of the FSM learning algorithm s convergence is not possible. As we ll show further (Par. 4), in the case of neural Fig. 2. Simplest Perceptron s Structure. = { 0, 1,..., n }, represents the input vector. = { }, represents the weight vector. X x x x W w0, w1,..., wn OX ( ) = W X= w+ å w x (1) 0 n k= 1 represents the output (decision hyperplane is given by the equation W X =0 ). ìï+ 1 if O( X ) > 0 Y = ï í ï ïî - 1 if O( X ) 0 represents the output s sign. Frequently for obtaining the Y value (through the so-called activation function) it is used the sigmoid function Y = 1 - Î [0,1] or other variations, -O( X) 1+ e like Y = 1-e - -O( X) - -O( X) k Ṽ [ 1,1]. The perceptron might 1+ e be used as a binary classifier or predictor (branch predictor in our case, predicting branch Taken = +1 or branch Not Taken = 1). Of course, this simple perceptron can correct classify only linearly separable sets of examples ( X ). For example, a logical XOR function can t be represented by a simple perceptron. Below we developed a simple proof of this fact, considering a perceptron with 3 entries (X0 = 1, bias), with the output O(X1, X2) = W0 + W1*X1 + W2*X2 = W * X (2) where X1, X2 are inputs and W0, W1 and W2 are the corresponding weights (parameters), belonging to the k Buletinul AGIR nr. 1/2016 ianuarie-martie 65

CREATIVITATE, INVENTICĂ, ROBOTICĂ real numbers set (R). Taking into account the XOR function: Y = X1 XOR X2, the following Table was obtained. Table 1. The XOR Function cannot be learned X1 X2 Y=X1 XOR X2 0 0 0=W0+W1*0+W2*0, thus W0=0 0 1 1=0+W1*0+W2*1, thus W2=1 1 0 1=0+W1*1+1*0, thus W1=1 1 1 0=0+1*1+1*1 0=2 (contradiction!) Taking into account the above obtained contradiction, a perceptron cannot represent or learn a XOR logical function (we cannot find some adequate weights for fulfilling XOR function). We call such a function a linearly inseparable function because there is not a (hyper-)plane W - * X - = 0 that can separate the set of positive examples, from the set of negative examples. Particularly to our previous example, in the three-dimensional space (Euclidian space) there is not a plane given by the equation W0+W1*X1+W2*X2= O(X1, X2) that can separate the positive XOR outputs (1) by its negative outputs (0). The main problem is how to develop a Learning Rule for a simple perceptron, in order to correctly learn a set of training linearly separable input vectors (examples) noted with D. If we consider a supervised learning rule, for each example (training vector) d D it is necessary to know the corresponding correct outputs called t d. If d 0 k dk k= 1 n O = w + å w x (3) is the real output, one common convenient measure of the global error E is: 1 2 Ew ( ) = å ( td -Od) (4) 2 dîd The rationale of ½ factor would be seen further; however, it is not essential. Given the equation (4) for Ew, ( ) its graphical representation must always be a quadric hypersurface (parabolic in this case) with a single (global) minimum. Of course, the particular weights vector w that gives this minimum are classifying in the best manner a certain example X, k = 0,1,..,n. dk a positive one (x 1 ) and a negative one (x 2 ). We note the corresponding targets as t 1 and t 2 respectively. In this case we can write: O(x 1 ) = w 0 + w 1 x 1 and O(x 2 ) = w 0 + w 1 x 2 (5) The corresponding error function (E) is: E: R 2 R, E(w 0,w 1 ) = = 0.5[(t 1 w 0 w 1 x 1 ) 2 + (t 2 w 0 w 1 x 2 ) 2 ] (6) It can be shown that the minimum global error s value is the following: E min = 0.25[t 1 t 2 + w 1 (x 2 x 1 )] 2 (7) that is obtained if the equation 2w 0 + w 1 (x 1 + x 2 ) = = t 1 + t 2 is fulfilled. Concretely, if we are considering the positive example as (x 1, t 1 ) = (1, 1) and the negative example as (x 2, t 2 ) = ( 1, 1) it results the following error function expression: E(w 0,w 1 ) = 0.5[(1 w 0 w 1 ) 2 + ( 1 w 0 + w 1 ) 2 ] = = w 2 0 + (w 1-1) 2 (8) In this case the global (unique) minimum is obtained for w 0 = 0 and w 1 = 1 and it is E min = 0. In fact the above equation represents an elliptic paraboloid having the equation z = x 2 + y 2, where z = E(w 0,w 1 ), x = w 0 and y = w 1 1. (A more general 2 2 x y z equation form of a paraboloid is 2 2 a + b = c ; substituting in the previous expression x with mx+n, y with py + q and z with rz + s, a more general form would be obtained). Using Wolfram Alpha software tool we obtained for this error function the following parabolic surface (with notations w 0 = x and w 1 = y), presented in Figure 3. 2. THE ERROR FUNCTION: AN INTUITIVE EXAMPLE We are considering now a simple particular case where the example set D contains only two examples: 66 Fig. 3. The 3D Error Function Representation. Buletinul AGIR nr. 1/2016 ianuarie-martie

DYNAMIC NEURAL BRANCH PREDICTION FUNDAMENTALS 3. THE PERCEPTRON LEARNING PROCESS The Ewgradient ( ) is written: é E E E ù n E E( w) =,,..., = å ik (9) ê w0 w1 w ú ë nû k= 0 wk where i k are the unit vectors belonging to the n orthogonal axes (meaning that the dot product between any two dot unit vectors is always 0.) It is well-known that - Ew ( ) produces the direction of the steepest local decreases in E (generally speaking, towards the nearest local minimum). In this case a rational learning rule might be the following one: W W+D W, where D W =- E( W), α = the learning rate (a small positive real number). This is equivalent with the following formula: E wk wk -, (") k= 0,1,..., n w k (10) In other words, each weight s value ( ) is modified on its own axis in order to decrease the error function s next value (until reaching its minimum.) But it follows: E æ1 2ö = ç å ( td - Od) = wk wk çè2 dîd ø ( td -W X) = å ( td - Od) = dîd wk =-å x ( t -O ) dîd dk d d (11) Finally the supervised learning rule is the following: w w + å ( t - O ) x, (") k= 0,1,..., n (12) k k d d dk dîd it is called gradient descent or delta rule. Instead of considering it a fixed small positive number, might be used as a descent function noted () t, too. Based on these simple computations we present now [4] the: Gradient Descent Perceptron Learning Algorithm Initialize each W k to (pseudo)random values é 2 2ù Î-, (recommended) ê ë n n ú û Until E( w) < T( threshold ), DO: Initialize each D W k = 0 For each learning pair (x d, t d ), from training examples, DO: Compute the corresponding output O d For each W k, DO: D w D w + ( t - O ) x For each w k, DO: w = w +D w k k d d dk k k k Further, instead of considering the sum of all the local errors we can consider a distinct (local) error E ( ) 1 ( ) 2 d w = td -Od (13) 2 for each certain example belonging to the D training set. Randomly giving the examples X d, provides a reasonable approximation to descending the gradient respect to the original global error E( w ). The stochastic gradient descent rule is the following [4]: Stochastic Gradient Descent Perceptron Learning Algorithm é 2 2ù Initialize each w k randomly to -, + (recommended) êë n núû Until the termination condition is met ( E ( w) < T or Od > T, etc.), DO: For each learning pair (x d, t d ), DO: Compute the output O d For each w k, do: w w + ( t - O ) x k k d d dk Summing over multiple examples, the standard gradient descent rule is more time consuming but it is often used with a larger step-size per weight update than stochastic (incremental) gradient descendent. (If for more complex perceptron structures E ( W ) has multiple local minima, stochastic gradient descendent can sometimes avoid falling into these minima because it uses various E d ( W ) to guide its search.) If we consider perceptron s output OX ( ) = sgn( W X) instead of OX ( ) = W X, this last simple rule is named the perceptron training rule, involving the following update: ( ), ( ) 0,1,..., wk wk + t- o xk " k= n (14) If the training example is correctly classified (t = o), no weights are updated, meaning that the perceptron is not learning from successes (in contrast with the empirical FSM Predictor presented in Figure 1!). Suppose now o = 1 and t = +1. In this case all w k having x k positive (like t) are incremented and the others w k are decremented. Similarly, if o = +1 and t = 1, all w k having x k negative (like t) are incremented and the others w k are decremented. Thus, the perceptron s learning rule is quite simple and intuitive. d Buletinul AGIR nr. 1/2016 ianuarie-martie 67

CREATIVITATE, INVENTICĂ, ROBOTICĂ 4. THE SIMPLE PERCEPTRON DYNAMIC BRANCH PREDICTOR It was introduced in [2]. The perceptron branch predictor rule is very simple: if its output value is positive (+1), the branch will be predicted as taken; otherwise ( 1), as not taken. This process is done during the perceptron s feed-forward phase (see Figure 2). Related to the perceptron branch predictor learning process (backward phase), as a general intuitive rule we observed that: if sgn t= sgn x k then w k is incremented; else w k is decremented. Based on this last qualitative learning rule D. Jimenez [2] proposed a pragmatically algorithm feasible to hardware implementation for training perceptron branch predictors, as the following one: Simplified Branch Predictor Learning Rule If sign() o ¹ t or ΩOΩ < T for k=0 to n do w k = w k + tx k (incrementing if sgn t = sgn x k ) (otherwise, decrementing) endfor endif In order to facilitate the hardware implementation of such a predictor Jimenez proposes further an even simplified learning algorithm, like the following: The Learning Algorithm Implemented in Hardware For each bit do in parallel (hardware parallelism is natural!) if t = x k then w k = w k + 1 else w k = w k 1 end if In the case of perceptron branch predictor the input vector X represents the usual input vector, as in a classical Two-Level Adaptive Branch Predictor (local/global histories, branch address PC L, path information, etc.) [7, 10]. It is considered that if t = 1 the branch was not taken and if t = 1 the branch was taken. Since t and x k are always either -1 or 1, these training algorithm increments the k th weight when the branch outcome agrees with the sign of x k (positive correlation), and decrements the weight when it disagrees (negative correlation) with the corresponding input value. In Figure 4 it is presented the whole perceptron branch predictor. Its processing steps are the followings: 1. The current instruction s PC (Program Counter value) is hashed to produce an index into the predictors table, containing perceptron s weights. 2. The i th perceptron is fetched from the table into a vector register containing weights (P vector). 3. Output O is computed as the dot product of P and the history register (global, local, etc.). 4. Branch prediction process (Taken if O > 0, Not Taken if O 0). 5. After the real branch s outcome is known, the training algorithm uses this outcome and the predicted value in order to correspondingly update the weights in P. 6. P is written back into the predictors table (PT). Fig. 4. The whole Prediction Process. 68 Buletinul AGIR nr. 1/2016 ianuarie-martie

DYNAMIC NEURAL BRANCH PREDICTION FUNDAMENTALS In [2] there are proposed some very clever simplifications in order to adapt the dynamic perceptron branch prediction algorithm to hardware implementation restrictions and to reduce the prediction s timing. Some of these simplifications are the followings: The weights for the perceptron predictor are implemented as signed integers represented in complementary code (8-bit weights, bytes). Since the history information (entry in the perceptron predictors) is codified using only 1 and 1, multiplication is not need to compute the dot product. Instead of multiplication, there are only adds (when the input bit X k is 1) respectively adds with two s- complement (when the input bit is 1). In practice, adding the one s complement approximates very well the two s complement, being simpler to implement in hardware. Additionally, this trick avoids the delay of a small carry-propagate adder that finally might add +1 for calculating the weight two s complement. More concretely, weights are bitwise exclusive-ored with the corresponding bits of the history register content. If the k th history bit is 1, the resulting weight (w) is one s complemented (XOR-ed with the Byte=11111111 2 ); otherwise the resulting weight is unchanged. In our opinion a simple logical negation of the weight byte could be an alternative to XOR-ing. Thus the predictor might be implemented as a binary adder. The adder s inputs are the weights bytes in their normal binary representation or one s complement representation. As a very important consequence, the predictor s hardware complexity is only linear, instead of exponential, as in the classical prediction schemes case. In other words, for example if the history (context) length is increasing with one bit, in principle just a new digital adder, operating on 2 bytes inputs, is additionally needed. In contrast, in the Two Level Adaptive Branch Predictors case, linearly increasing the history length involves an exponential growth in prediction tables capacities. After processing weights, the sum is obtained using a Wallace - tree of 3 to 2 carry-save adders, which reduce adding n bytes to the problem of adding just two bytes [2]. The final two numbers are added with a carry-look-ahead adder. The computation is relatively quick (log 2 n) because only the sign bit of the result is needed to make a prediction. Additional performance gains can be found for branch history lengths of up to 66 bits. As we already shown, such long histories are impossible to be exploited by Two Level Adaptive Branch Predictors due to the involved huge complexity. Jimenez reports a global improvement of 5.7% over McFarling-style hybrid predictor (he considered also a gshare/ perceptron overriding hybrid predictors). The main disadvantage of the perceptron predictor consists in its high latency. Even if are using some high-speed arithmetic tricks, like we already mentioned, the computation latency is relatively high comparing with the clock period of a deeply pipeline microarchitecture. It was shown that the perceptron branch predictor is the best one for linearly separable branches (comparing with all classical predictors). 5. IMPROVING THE HARDWARE PERCEPTRON PREDICTOR In order to further reduce the prediction latency, in [3] is presented a perceptron predictor choosing its weights for generating a prediction according to the current branch s path, rather than according to the branch s PC and a simple binary global history register. Considering a path of the last h branches, the predictor keeps a matrix W of weight vectors, having N lines and (h+1) weights per line w[i,0], w[i,1],,w[i,h]. Each time a branch is fetched, its 0 th weight w[i,0] is added to a running total that has been kept for the last h branches, with each summ and added during the processing of the previous branches. The key idea is that each branch is predicted correlated with the previous h branches predictions rather than with the previous h branches outcomes. Therefore, in order to improve prediction s latency, even the prediction information is speculative in this case. This so-called path-based neural prediction method has two main advantages: The prediction latency is almost completely hidden because computation of output O can begin in advance of the effective prediction, with each step proceeding as soon as a new element of the path is executed. The most critical-timing operation is the sum of the bias weight and the current partial sum. Prediction accuracy is improved because the predictor incorporates path information, too. As it is shown in [6, 8] path information could slightly improve the branch s prediction accuracy. In [3] it is showed that this path-based perceptron predictor outperforms classical predictors at the same hardware budget. For example, it is reported that at 64 KB hardware budget it delivers an IPC 16% higher than that of the simple perceptron predictor, due to its lower latency. Details about this very effective path-based neural prediction are given in [3]. A presentation trying to develop a synthetic model in a rigorous comprehensive intuitive manner is given in [9], too. 6. HYBRID PREDICTORS AND META-PREDICTORS There is not a best branch predictor. For example some branch predictors could obtain better prediction accuracy while others can obtain better Buletinul AGIR nr. 1/2016 ianuarie-martie 69

CREATIVITATE, INVENTICĂ, ROBOTICĂ prediction latency. Perceptron predictors are usually better than the classical Two Level Adaptive Branch Predictors (TLABP) from the accuracy point of view but they are not so quick. Perceptron predictors are very effective in predicting linear separable branches but they cannot successfully predict branches that are not separable by a hyper-plane. Fortunately TLABPs could be effective in predicting such difficult branches because each context pattern (the most recent n bits of branch s history) indexes its own predictor. Another example: in order to accurately predict indirect jumps/calls target address (generated, for example, by polymorphisms), some special dedicated value predictors are necessary [9]. Therefore, the idea of hybrid branch prediction is natural, meaning that some different (heterogeneous) branch predictors are working together. Most commercial superscalar microprocessors implement two or even three distinct branch predictors working together. In this case a new challenge is to design an effective meta-predictor that would dynamically (and adaptively) select the best branch predictor at a given moment. The metapredictor can be designed as non-adaptive or even adaptive (like a perceptron, for example!) Its decision can be based on the branch predictors confidences. The predictor s confidence could be represented as a binary number on k bits showing the predictor s behaviour during its last k instances (1 correct prediction, 0 incorrect prediction). The next Figure shows the generic scheme of a TLABP and a Neural Branch Predictor (NBP) working together through an adaptive Meta-Predictor (MP). Confid1 and Confid2 are representing the confidences attached to each branch predictor. Based on this information and the local predictions (Pred1 and Pred2), the adaptive MP will generate the final prediction (Taken/Not Taken). After the branch is resolved, the branch s outcome will update these structures in an adaptive way. The most effective hybrid branch predictors presented in the Branch Predictors Championships frame were combinations of TLABP and perceptron predictors. The main problem in such hybrid schemes is the prediction s latency. 70 Fig. 5. A Hybrid (TLABP, NBP) using a MP. 7. CONCLUSION Machine Learning domain could fertilize Computer Architecture field, too. Dynamic Neural Branch Prediction was one of the earliest successful examples in this sense, being now an interesting success story. Some prestigious companies like Intel were interested in implementing such neural branch predictors in modern commercial microprocessors. Thus the idea is fertile. Some further applications of neural prediction implemented in hardware might be seen in dynamic value prediction, too [9]. Also, the idea of neural prediction might be useful in implementing the meta-prediction level of some hybrid predictors that are working together. More generally, exploiting the synergism of some different domains could generate effective interdisciplinary approaches. This work presented the theoretical and practical foundations of dynamic neural branch prediction approach, as it might be implemented in modern superscalar microarchitectures. Using refined machine learning and artificial intelligence methods in processor design could significantly improve the CPU performances through an intelligent, context sensitive behaviour. Adapting such powerful theoretical methods to hardware restrictions is a true art, involving not only refined theoretical and practical methods, but also researcher s intuition and talent. Only by fertilizing Computer Architecture domain with powerful theoretical methods it would become sometime a mature science, and not an empirical one, as it is today. REFERENCES [1] Hennessy J., Patterson D., Computer Architecture: A Quantitative Approach, Morgan Kaufmann (Elsevier), 4-th Edition, 2007. [2] Jimenez D., Lin C., Neural Methods for Dynamic Branch Prediction, ACM Transactions on Computer Systems, vol. 20, No. 4, November 2002. [3] Jimenez D., Fast Path-Based Neural Branch Prediction, The 36 th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-36), San Diego, USA, December 2003. [4] Mitchell T., Machine Learning, McGraw Hill, 1997. [5] Vinţan L., Towards a High Performance Neural Branch Predictor, International Joint Conference on Neural Networks, Washington DC, USA, 10-16 July, 1999. [6] Vinţan L., Egan C., Extending Correlation in Branch Prediction Schemes, Proceedings of 25 th Euromicro International Conference, Milano, Italy, 8-10 September1999. [7] Vinţan, L., Instruction Level Parallel Architectures (in Romanian), Romanian Academy Publishing House, Bucharest, 2000. [8] Vinţan, L. et al., Understanding Prediction Limits through Unbiased Branches, Eleventh Asia-Pacific Computer Systems Architecture Conference, Shanghai 6-8 th, September, 2006. Buletinul AGIR nr. 1/2016 ianuarie-martie

DYNAMIC NEURAL BRANCH PREDICTION FUNDAMENTALS [9] Vinţan L., Prediction Techniques in Advanced Computing Architectures, Matrix Rom Publishing House, Bucharest, 2007. [10] Yeh, T. and Patt Y. N., A Comparison of Dynamic Branch Predictors that Use Two Levels of Branch History, ISCA- 20, San Diego, May 1993. Prof. Eng. Lucian N. VINȚAN, PhD. Lucian Blaga University of Sibiu About the author Currently Professor and Ph.D. Supervisor in Computer Engineering at Lucian Blaga University of Sibiu, Romania. He is an expert in the areas of instruction/thread level parallelism, multi-core and many-core systems, automatic design space exploration (multi-objective optimization and meta-optimization), prediction techniques in ubiquitous computing systems and text mining (classification/clustering). Professor VINȚAN published over 140 scientific papers in some prestigious journals and international conferences. His publications acquired over 550 citations through over 350 works published in international conferences and scientific journals. He received The Romanian Academy Tudor Tanasescu Prize in 2005. In 2012 he was elected full-member of The Academy of Technical Sciences of Romania. Details about his professional activity might be found at URL: http://webspace.ulbsibiu.ro/lucian.vintan/html/ and http://www.astr.ro/membri/vintan-lucian_181 Buletinul AGIR nr. 1/2016 ianuarie-martie 71