A framework for the timing analysis of dynamic branch predictors

Similar documents
Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Portland State University ECE 587/687. Branch Prediction

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Preemption Delay Analysis for Floating Non-Preemptive Region Scheduling

Timing analysis and predictability of architectures

Branch Prediction using Advanced Neural Methods

Unit 6: Branch Prediction

SUFFIX TREE. SYNONYMS Compact suffix trie

ICS 233 Computer Architecture & Assembly Language

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

CPSC 3300 Spring 2017 Exam 2

Program Analysis. Lecture 5. Rayna Dimitrova WS 2016/2017

Caches in WCET Analysis

Fall 2011 Prof. Hyesoon Kim

Accurate Estimation of Cache-Related Preemption Delay

Chaos and Dynamical Systems

/ : Computer Architecture and Design

Special Nodes for Interface

Minimizing a convex separable exponential function subject to linear equality constraint and bounded variables

Robot Position from Wheel Odometry

Lecture 6 January 15, 2014

Timing analysis and timing predictability

Structuring Unreliable Radio Networks

Design and Analysis of Time-Critical Systems Response-time Analysis with a Focus on Shared Resources

#A50 INTEGERS 14 (2014) ON RATS SEQUENCES IN GENERAL BASES

Luis Manuel Santana Gallego 100 Investigation and simulation of the clock skew in modern integrated circuits. Clock Skew Model

Beyond Loose LP-relaxations: Optimizing MRFs by Repairing Cycles

Embedded Systems 23 BF - ES

A Novel Meta Predictor Design for Hybrid Branch Prediction

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Upper Bounds for Stern s Diatomic Sequence and Related Sequences

The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line

Modifying Shor s algorithm to compute short discrete logarithms

CSCI-564 Advanced Computer Architecture

1Number ONLINE PAGE PROOFS. systems: real and complex. 1.1 Kick off with CAS

Energy-Efficient Scheduling of Real-Time Tasks on Cluster-Based Multicores

A Detailed Study on Phase Predictors

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Fast inverse for big numbers: Picarte s iteration

Divide-and-Conquer. Reading: CLRS Sections 2.3, 4.1, 4.2, 4.3, 28.2, CSE 6331 Algorithms Steve Lai

Shannon-Fano-Elias coding

Branching Bisimilarity with Explicit Divergence

Polynomial Degree and Finite Differences

A Definition and Classification of Timing Anomalies

Evaluation and Validation

Predicting globally-coherent temporal structures from texts via endpoint inference and graph decomposition

Simple Examples. Let s look at a few simple examples of OI analysis.

Weak Keys of the Full MISTY1 Block Cipher for Related-Key Cryptanalysis

IN this paper we study a discrete optimization problem. Constrained Shortest Link-Disjoint Paths Selection: A Network Programming Based Approach

This is a repository copy of Attributed Graph Transformation via Rule Schemata : Church-Rosser Theorem.

Fall 2008 CSE Qualifying Exam. September 13, 2008

Regression-based Statistical Bounds on Software Execution Time

Representation theory of SU(2), density operators, purification Michael Walter, University of Amsterdam

ENEE350 Lecture Notes-Weeks 14 and 15

Merging and splitting endowments in object assignment problems

Transposition Mechanism for Sparse Matrices on Vector Processors

Performance, Power & Energy

QUALITY CONTROL OF WINDS FROM METEOSAT 8 AT METEO FRANCE : SOME RESULTS

Automata, Logic and Games: Theory and Application

The Incomplete Perfect Phylogeny Haplotype Problem

The WHILE Hierarchy of Program Schemes is Infinite

FinQuiz Notes

Lecture 12: Grover s Algorithm

Section 8.5. z(t) = be ix(t). (8.5.1) Figure A pendulum. ż = ibẋe ix (8.5.2) (8.5.3) = ( bẋ 2 cos(x) bẍ sin(x)) + i( bẋ 2 sin(x) + bẍ cos(x)).

Depth versus Breadth in Convolutional Polar Codes

Ahmed Nazeem and Spyros Reveliotis

CS 4120 Lecture 3 Automating lexical analysis 29 August 2011 Lecturer: Andrew Myers. 1 DFAs

CONNECTOR ALGEBRAS FOR C/E AND P/T NETS INTERACTIONS

A Multi-Periodic Synchronous Data-Flow Language

Online Supplementary Appendix B

Model-Based Design of Game AI

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

Copyrighted by Gabriel Tang B.Ed., B.Sc. Page 111.

The Capacity Region of 2-Receiver Multiple-Input Broadcast Packet Erasure Channels with Channel Output Feedback

3.5 Solving Quadratic Equations by the

Distributed Timed Automata with Independently Evolving Clocks

CMP N 301 Computer Architecture. Appendix C

arxiv: v1 [cs.os] 6 Jun 2013

An Efficient Parallel Algorithm to Solve Block Toeplitz Systems

Vertical Implementation

Topological structures and phases. in U(1) gauge theory. Abstract. We show that topological properties of minimal Dirac sheets as well as of

H-infinity Optimal Distributed Control in Discrete Time

Real-Time Issues in Live Migration of Virtual Machines

Design of Distributed Systems Melinda Tóth, Zoltán Horváth

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Module 9: Further Numbers and Equations. Numbers and Indices. The aim of this lesson is to enable you to: work with rational and irrational numbers

Plan for Today and Beginning Next week (Lexical Analysis)

LR-KFNN: Logistic Regression-Kernel Function Neural Networks and the GFR-NN Model for Renal Function Evaluation

STEP Support Programme. Hints and Partial Solutions for Assignment 2

Mathematics Background

2 discretized variales approach those of the original continuous variales. Such an assumption is valid when continuous variales are represented as oat

N-Synchronous Kahn Networks A Relaxed Model of Synchrony for Real-Time Systems

Rollout Policies for Dynamic Solutions to the Multi-Vehicle Routing Problem with Stochastic Demand and Duration Limits

Attributed Graph Transformation via Rule Schemata: Church-Rosser Theorem

New Constructions of Sonar Sequences

Exploiting Bias in the Hysteresis Bit of 2-bit Saturating Counters in Branch Predictors

Pseudo-automata for generalized regular expressions

An Adaptive Clarke Transform Based Estimator for the Frequency of Balanced and Unbalanced Three-Phase Power Systems

Regression-based Statistical Bounds on Software Execution Time

Transcription:

A framework for the timing analysis of dynamic ranch predictors Claire Maïza INP Grenole, Verimag Grenole, France claire.maiza@imag.fr Christine Rochange IRIT - CNRS Université de Toulouse, France rochange@irit.fr Astract Real-time systems require predictale processor ehavior such that tight upper ounds on the worst-case execution times (WCETs) of their critical tasks can e derived. Methods for estimating these upper ounds received much attention in the last fifteen years. However, the use of high-performance processors to meet ever growing performance requirements demands the development of more and more complex models. This is also the case for dynamic ranch prediction schemes that are implemented in processors used in emedded systems. In this paper we focus on the static timing analysis of dynamic ranch prediction ased on the implicit path enumeration technique. The goal of this paper is to introduce a framework that allow to generate the IPET model of dynamic ranch prediction mechanisms. With this framework we generated the models of some ranch predictors and compare them. 1. Introduction Real-time systems differ from other computer systems y the fact that respecting timing constraints is as important as providing correct results. In other words, the system not only has to deliver correct results, ut must also deliver them in time. This must e proved y timing analysis. The ojective is to assess that the longest execution time of each task makes it possile to uild a task schedule that meets the constraints. When time-critical applications are considered, the timing analysis is expected to produce safe upper ounds of the execution times, also called upper ounds of the WCETs, for Worst-Case Execution Times. Techniques for WCET analysis have een investigated for more than twenty years ut they always need improvement and extensions to support increasingly complex hardware (needed to fulfill ever growing performance requirements). For example, recent emedded processors like ARM 11 and Cortex-A8 or PowerPC 750 implement dynamic ranch prediction schemes. Their ehavior is to e taken into account when computing WCETs. Several solutions have een proposed in the literature and will e descried elow. Our contriution in this paper is a general framework to derive models for dynamic ranch predictors the most frequently availale in emedded cores. The static analysis of dynamic ranch prediction can e local, i.e., each ranch is considered in isolation and its ehavior is determined y the algorithmic structure it implements, or gloal, i.e., all ranches are analyzed conjunctly so that their possile interactions can e accounted for. Local approaches [5, 3, 2] aim at computing the numer of mispredictions for each ranch y only considering its previous executions. Then, they are limited to simple dynamic ranch predictors such as imodal predictors (see Section 2) that do not rely on a gloal history of ranches. Furthermore, except for [5] that is ased on the source code, the local approach is only used under the very optimistic assumption that the entries of the prediction tale are not shared: The effect of sharing needs to e analysed separately, in a similar way of cache analysis, and in the case of conflicts, the ranches are considered as mispredicted. Gloal approaches descrie the ehavior of the ranch predictor as a set of constraints [7, 4] that extend the ILP formulation used to estimate the WCET of the program with the IPET method [8]. The solution of the ILP prolem includes the execution count of each ranch and its numer of mispredictions along oth directions. Previous work using the gloal approach introduce a system of constraints to model one specific ranch predictor: one ased on local histories and 1-it counters [7] and one ased on a gloal history and 2-it counters [4]. In this paper, we consider the gloal approach. The main contriution of the paper is a framework to model ranch predictors in a generic way, which extends previous solutions [7, 4]. This framework allows modeling any ranch predictor that uses a ranch history tale (BHT), 1-it or 2-it counters to compute the prediction and various ways to index the BHT (address of the ranch, local or gloal history, or any comination). It is structured in three parts, each one eing related to a view of the BHT ehavior: computation of the BHT index, computation of

the interdependencies etween ranches sharing the same BHT entry, and computation of the prediction y prediction counters. The framework allows to model a large numer of BHT configurations. It is generic on: - the size of the BHT, - the ranch history (local or gloal), - the size of the ranch history, - the matching function used to index the BHT (for instance: exclusive-or), - the size of prediction counters (1 it or 2 its). Varying these parameters, we generate some models of ranch predictors. The second contriution of the paper is an evaluation of the gloal approach ased on the generated models. We use the modeling complexity notion [4] to compare the size of the models. From this comparison we conclude aout the gloal approach and the ranch predictors that fit etter to e modeled y this approach. Note that the timing analyses of ranch prediction using these two approaches are limited to architectures that do not exhiit timing anomalies [?], i.e., a misprediction is the worst case and can e accounted for y a fixed constant penalty. The paper is organized as follows. Section 2 gives a short overview of dynamic ranch prediction and lists the schemes that are implemented in some emedded processors. In Section 4 we introduce our framework and show how it can e used to generate the models of some dynamic ranch predictors. The approach is evaluated in Section 5 and we give some concluding remarks in section 6. 2. Dynamic ranch prediction Branch prediction enhances the pipeline performance y allowing the speculative fetching of instructions along the predicted path after a conditional ranch has een encountered and until it is resolved. If the ranch was mispredicted, the pipeline is flushed and the correct path is fetched and executed. The BTB (Branch Target Buffer) is used to (i) detect the presence of a ranch instruction in the flow of fetched instructions efore it has een decoded and (ii) to provide the ranch target address in case the ranch is predicted as taken. The BTB is indexed y the lower its of the program counter (PC) and each entry is tagged with the ranch address (PC) for verification. Possile directions of a ranch instructions are taken (usually coded y 1 ) or not taken (coded y 0 ). The dynamic prediction of a ranch direction is generally ased on a counter that reflects the ranch history. A 1- it counter stores the last outcome of the ranch. More often, a 2-it saturating counter is used. Its four possile states are show in Figure 1: strongly taken (T), weakly taken (t), weakly not taken (nt) and strongly not taken (NT). The counter is incremented when the ranch Figure 1. Behavior of a 2-it ranch prediction counter is taken and decremented otherwise. The next instance of the ranch is predicted as taken whenever the upper it is set [11] and not taken when it it cleared. Several prediction counters(to e used y different ranches) are maintained in a BHT (Branch History Tale). Dynamic ranch predictors differ y the way they index the BHT: the imodal ranch predictor uses the ranch instruction address (PC) [11], while gloal ranch predictors use an n-it register that stores the outcomes of the n last executed ranches [12]. For these schemes, the BHT can e directly indexed y the gloal history register (see Figure 2) or y a function (generally an exclusive- OR) of the history and the ranch PC. This last scheme is known under the name gshare. A local history, private to each ranch, can also e used. Note that more complex schemes exist ut we have not found them implemented in any emedded processor. Therefore, we do not consider them in this paper. Unlike the Branch Target Buffer, the Branch History Tale is usually not tagged to reduce costs. Two ranches reaching the same entry of the BHT use the same prediction counter. Shared entries are more frequent when the BHT is indexed y a gloal history, even when this phenomenon known as aliasing is reduced y XOR-ing the history with the ranch PC. Sharing can e advantageous (i.e. a ranch takes advantage from the fact that its 2-it counter has een updated y another ranch) ut it is more often destructive. Tale 1 shows the ranch prediction schemes implemented in four emedded processors. The BHT is indexed y the ranch PC in most of the cases. However, one processor implements a gloal-history ranch predictor. Core BHT size BHT Index Prediction ARM-CortexA8 4096 history and PC a 2-it ARM-11 128 PC 2-it PowerPC-750 512 PC 2-it MIPS-34K 512 PC 2-it a 10 its of history and 4 its of PC used to compute the index unified BTB and BHT Tale 1. Branch predictors implemented in five emedded processors 2

Figure 2. A gloal-history dynamic ranch predictor In this paper, we focus on these types of dynamic predictors. The prediction is ased on 2-its counters, and the BHT is indexed y a function of the ranch instruction address and the gloal history. In the remainder of this paper, we focus on the analysis of the BHT ehavior. Modeling the BTB (that contains the target addresses for ranches that are predicted taken) is out of the scope of this work. For this reason, we assume that each ranch always hits in the BTB. As future work, we plan to comine our analysis to a BTB analysis such as [5, 6]. 3. Modeling dynamic ranch prediction y IPET Throughout this paper, we use the following notation. A control flow graph is given y: CFG = (X,E,entry,exit) where X = { 0,..., n } denotes the set of asic locks i of the program and E X X the corresponding set of edges connecting them. The entry node is denoted y entry and the exit node y exit. Furthermore, we denote the set of asic locks containing a conditional ranch y: B X. B = { X x 1,x 2 X,(,x 1 ),(,x 2 ) E and x 1 x 2 } We also use the set of predecessors of a lock and the set of successors: i X,S i = {s X (i,s) E} i X,P i = {p X (p,i) E} Finally, we use a partial function lael on the edges to mark the direction of the conditional ranches: B,lael((,s 1 )) lael((,s 2 )) where s 1 and s 2 are the successors of (S = {s 1,s 2 }) and the direction of a conditional ranch is taken (1) or nottaken (0): D = {0,1} ( s S,lael((,s)) D) Figure 3 shows a CFG that is used as an example in this paper. For this CFG, we have X = { 0.. 8 }, entry = 0, exit = 2, B = { 1, 3, 5 }, S 3 = { 4, 5 }, lael(( 3, 4 )) = 1, lael(( 3, 5 )) = 0. 3.1. WCET analysis with the IPET The IPET method [8] aims at determining the longest execution path and computing an upper ound on the WCET. It considers the control-flow graph of the task under analysis. A set of integer variales stands for the execution counts of the nodes (asic locks) and edges. The Figure 3. Example CFG execution time of the task can e expressed y T = x i t i, i X where x i and t i are the execution count and the worst-case execution time of asic lock i, respectively. Estimating an upper ound on the WCET consists of maximizing this expression under a set of linear constraints. Flow constraints express the control flow structure of the task, i.e., the relations that must exist etween the execution counts of nodes (x i for asic lock i ) and edges (x i d j for the edge from i to j laeled with direction d in some equations, we will use the equivalent notation x i d ). The flow constraints are: i X,x i = x l + init i p P p i i i X,x i = x l + end i s S i s i end i = 1 i X i X init i = 1 Note that end exit = 1 and i X,i exit end i = 0. Similarly, init entry = 1 and i X,i entry init i = 0. From the CFG illustrated in Figure 3, the flow constraints for lock 3 are: x 3 = x 1 0 3 + x 8 u 3 + init 3 x 3 = x 3 0 5 + x 3 1 4 + end 3 3

To ound the execution counts, the system needs some additional constraints which we call semantic constraints. They express loop ounds, infeasile paths, etc [?]. For our example CFG, the system should least include the following semantic constraints: x 1 0 3 MAX 1 x 0 u 1 x 3 0 5 MAX 3 x 1 0 3 where MAX n is a loop ound, i.e. the maximum numer of iterations of a loop, that depends on the semantic of the program. An upper ound on the WCET can e otained using Integer Linear Program solvers like lp solve 1. The solution is returned as the set of values of the execution counts that maximize the task execution time and satisfy all constraints. 3.2. Integrated analysis of dynamic ranch prediction The IPET method computes the overall execution time of a program y adding the execution times (costs) of the asics locks weighted y their respective execution count. To take the impact of ranch prediction into account, the expression of the execution time is extended with an additional cost for each lock that is the target of a conditional ranch whenever it is mispredicted. This cost includes all the effects of ranch prediction on the pipeline state. In this paper, we ignore the impact of ranch prediction on cache memories (considered as perfect 2 ). Note that a conditional ranch is always the last instruction of a asic lock. In the rest of the paper, for the sake of simplicity, B is used to denote oth a asic lock and the conditional ranch it ends with (if any). Variales m 0 and m 1 denote the numer of mispredictions for ranch B when it follows direction d {0,1} (0 is for not-taken, 1 for taken) while v d is the related cost. The overall execution-time can e expressed y: T = x i t i + i X j B d {0,1} m j d v j d Besides the addition of ranch misprediction penalties to the ojective function, additional linear constraints should express the ehavior of the ranch predictor. These constraints can e generated in a methodical way through the generic framework we introduce in the next section. They ound the worst-case numer of mispredictions for oth directions of each ranch B. 4. A framework to model dynamic ranch predictor using IPET In this section, we descrie our framework. We show how the ehaviour of any ranch predictor is modeled y the IPET using the gloal approach. 1 http://lpsolve.sourceforge.net/5.5/ 2 Perfect cache means that each access to the cache is a hit. When the ehaviour of the ranch predictor at a given point in the program is to e determined, three questions should e answered. The first one is: Which entry of the BHT is indexed to predict the ranch? Formally, the set of all indexes is denoted y Λ = [0,2 n ] where n is the size of the BHT. For each ranch B, Λ is the set of possile indexes of the BHT that may e used to predict this ranch. The index λ Λ may depend on the state π of the gloal history and on the ranch instruction address @: Λ = {λ π Π,λ = f (@,π )} where Π Π is the set of possile values of the gloal history at (Π is the set of all values of the gloal history). Note that @ is usually not the whole ranch instruction address ut only a significant part of it. For a imodal predictor, the index is computed directly from the ranch instruction address: Λ = f (@,π ) = {@}. In the case of gshare, the mapping function f is an exclusive-or: Λ = {λ π Π,λ = @ π }. The set of equations related to this first question is given in Section 4.1. Note that these equations are not derived for predictors that do not use a history register (e.g. imodal). The second question is: Which other ranches may access the same BHT entry? Rememer that the capacity of the BHT is limited and it is not tagged with ranch addresses: some ranches may share the same BHT entry. Such ranches may update the 2-it counter according to their own ehaviour and thus influence the prediction. Equations that express the aliasing to the BHT and the order in which competing ranches access to the shared BHT entry are given in Section 4.2. The third question is: What may e the 2-it counter s state and thus the predicted direction? Some equations are required to express the possile ehaviours of the 2-it counter stored in the indexed BHT entry. They are derived in Section 4.3. Finally, all these equations must e related to uild the whole dynamic ranch prediction model. This is explained in Section 4.4 The framework eing generic, the size of the BHT, the size of the history, and the index function are formally descried. 4.1. BHT index In this Section, we show how the evolution of the gloal history can e modeled y a system of constraints. A gloal history is a n-it register. The gloal history is updated when the ranch issue is known ( 1 for taken, 0 for not-taken). The update consists of a left shift of the n 1 rightmost its and an insertion of the issue of the last executed ranch to the rightmost position. For instance π = 0100 is the new value of the gloal history when π = 1010 was the previous value and the last ranch is solved as not-taken. The update of the history is denoted y π = π.d : the state π is the successor of the state π (left shift and insert d to the rightmost position). The value of the gloal history at a given point p in a program depends on the initial value of the gloal history and on 4

the issue of each executed ranch along the path from the entry node to p. We introduce a graph to model the evolution of the gloal history. It depicts the value of the gloal history depending on the initial value and the issue of previous ranches. B, π Π, x π = x π + p d initπ p P π π Π where π = π.d and p d B, π Π, x π = xπ + 0 xπ + 1 endπ end π 1 B B init π 1 In [7] the value of gloal history is computed for each asic lock of the CFG. As this history may only evolve after a conditional ranch, we have enhanced this part of the gloal approach y only taking into account the asic locks that contain a conditional ranch. 4.2. Execution context Figure 4. Example: BHG. The Branch History Graph (BHG) expresses the overall evolution of the gloal history for a program. Each node of the BHG fits with a pair ( i,π) where i is a asic lock of the CFG containing a conditional ranch and π is one possile value of the gloal history at this ranch. The nodes laelled init target the ranches that may e the first ranch executed. For the sake of simplicity, end nodes are not shown on this graph and on the following ones. The BHG can e automatically uilt from the CFG. Figure 4 shows the BHG uilt for the CFG of Figure 3. From the BHG, we derive a set of equations that will e included in the IPET formulation. The value of the gloal history depends on the issue of the last executed ranches: we use p d to denote that p B is the last ranch to e executed efore B and d was the direction of ranch p (used to update the gloal history). One execution of the program under analysis consists of one path in the BHG from a node (,π), where is the first executed ranch and π the initial value of the ranch history, to a node (,π ) where is the last executed ranch and π the last value of the history. Flow constraints are derived for each node of the BHG: they link the executions counts of locks with a given history register value (x π ) to the execution counts of their in- and out-edges (x π ). Furthermore, if is the first executed d ranch and π the initial value of the gloal history then init π = 1. Similarly, endπ = 1 if is the last executed ranch. The set of predecessors of a node (,π) in the BHG is denoted y P π. The BHT is a Λ -entry tale. Each ranch may access Λ 2-it counters during the execution of the program. For a given λ Λ, the prediction of the ranch depends on all ranches sharing this entry. B λ denotes the set of ranches that may access entry λ of the BHT: B λ = { B λ Λ }. The prediction counter at λ is updated at each execution of one ranch accessing it. Then, the prediction depends on the access sequence to the entry λ. In other words, the prediction of a ranch accessing the entry λ depends on the execution sequence of ranches in B λ. To model the sharing, we introduce a new graph. It depicts the possile execution sequences etween ranches in B λ. A Branch Conflict Graph (BCG λ ) can automatically e derived for each BHT entry and contains as many locks as the numer of ranches sharing this entry ( B λ ). Figure 5 shows BCG λ=00 for the example program (B λ=00 = {B 1,B 3,B 5 }). The edge from 5 to 3 states that there is a path in the BHG from ( 5,00) to ( 3,00) and 3 is the first lock with history 00 along this path. Note that the lael on this edge represents the direction of ranch 5: 0 or 1 is used when oth directions may lead to the target lock. Each BCG is descried y IPET: One execution of the program under analysis may consist of at most one path in a BCG from an initial node to a final node. The init edges of a BCG λ target the possile initial nodes: Branches that may e the first executed with the index value λ (init λ = 1 if is the first executed ranch with λ). Similarly, the end edges (not shown) start on the final nodes: Branches that may e the last executed with this history value (end λ =1 if is the last executed ranch with λ). The set of constraints corresponding to BCG λ links the execution order etween ranches sharing the λ-th entry of the BHT. P λ (resp. S λ ) denote the set of predecessors (resp. successors) of in BCG λ. 5

B λ, x λ = (x λ + s S λ s 0 xλ ) + s 1 endλ B λ, x λ = (x λ + p P λ p 0 xλ ) + p 1 initλ end λ 1 B λ init λ 1 B λ 4.3. Prediction computation All ranches in B λ may e predicted y the counter in the λ-th entry of the BHT. The prediction is given y the state c of the counter (c C and C = {00,01,10,11}) and the update of the counter is done according to the update of a saturating 2-it counter. We introduce a new graph representing the evolution of the 2-it counter depending on the initial state and the possile transitions. A Prediction Graph (PG λ ) is uilt for each prediction counter (each entry of the BHT) and expresses the possile evolution of this counter. Figure 6 shows PG λ=00 for the example program in case the BHT is indexed y the gloal history. As in the previous steps, this graph is descried y a system of constraints. Note that the set of possile transitions to reach (resp. leave) a state depends on the set of possile successors (resp. predecessors) in the related BCG λ. One execution of the program under analysis consists of at most one path in a PG: from the initial state to the final state. B λ, x λ,00 = (x λ,00 + p P λ p 0 xλ,01 ) + p 0 initλ,00 x λ,00 = (x λ,00 + s S λ s 0 xλ,00) + s 1 endλ,00 x λ,01 = (x λ,00 + p P λ p 1 xλ,10 ) + p 0 initλ,01 x λ,01 = (x λ,01 + s S λ s 0 xλ,01) + s 1 endλ,01 x λ,10 = (x λ,01 + p P λ p 1 xλ,11 ) + p 0 initλ,10 x λ,10 = (x λ,10 + s S λ s 0 xλ,10) + s 1 endλ,10 x λ,11 = (x λ,10 + p P λ p 1 xλ,11 ) + p 1 initλ,11 x λ,11 = (x λ,11 + s S λ s 0 xλ,11) + s 1 endλ,11 B λ c C B λ c C end λ,c 1 init λ,c 1 A misprediction occurs each time a fired transition does not match to the prediction given y its source state, e.g., leaving the state 00 y a transition laeled 1. m λ = (x λ,00 + 1 s S λ s 1 xλ,01) s 1 m λ = (x λ,10 + 0 s S λ s 0 xλ,11) s 0 Figure 5. Example: Execution order etween ranches sharing the entry 00 of the BHT (Branch Conflict Graph) In [7] Li et al. consider a 1-it counter. In this case, the evolution does not need to e modeled y additional constraints ecause the counter state is the issue of the last executed ranch. The numer of mispredictions is given y: m λ = x 0 1 s S λ s 1 m λ = x 1 0 s S λ s 0 In the remainder of this paper, we only focus on 2-it counters which are the most frequently used due to their highest accuracy. 4.4. Model of ranch predictor In this susection we descrie the whole framework. First, we focus on the generation of the graphs: ranch history graph, ranch conflict graphs and prediction graphs. Then, we show how to uild the whole system of constraints: it is ased on the three susets of constraints descried in previous susections and on some additional constraints that are necessary to create a link etween the three susets and the IPET from the CFG. Finally, we summarize how to use the framework: what are the parameters and an example of configuration. Graph generation All the graphs on which the generation of constraints is ased on are generated from the CFG. The BHG is given y a partition of the CFG lock: The BHG contains only the asic locks with two outgoing edges (conditional ranch). Each BCG is a partition of the BHG or of the CFG. BCG λ is uilt from: - all BHG locks (,π) with λ = f (π), in case a gloal history is used, - all CFG locks ending y a conditional ranch which BHT index is @ = λ for the imodal predictor. From each BCG λ, one PG λ is uilt: for each edge in BCG λ the corresponding transitions are generated in the related PG λ. 6

Figure 6. Example: 2-it counter evolution for λ = 00 (PG 00, history-indexed BHT) x x π x λ x λ,c CFG BHG BCG λ PG λ Tale 2. Variale used for the numer of occurences of a lock in the different graphs. The whole framework In the rest of this section, we explain how to uild the whole system of constraints and list the additional constraints generated for that purpose. Tale 2 reminds the notation for the numer of occurences of a lock depending on the corresponding graph. A BHG is needed when a history is used to index the BHT. If the histories are local, then a BHG is uilt for each history register 3. In case of a gloal history, one BHG is generated. In a BHG, more than one lock (,π) may refer to the same lock in the CFG. The set of constraints generated to link the susystems related to the CFG and the BHG is: B, x = π Π x π B, d {0,1}, x d = π Π x π d + endπ d B, π Π, end π = endπ 0 + endπ 1 Note that accounting for the last ranch in the BHG does not include the last issue of this ranch. However, in the CFG this last issue is needed to otain the correct numer of occurences of the related edge in the CFG. One BCG λ is generated for each λ Λ. The Λ susets of constraints derived from those BCGs are linked to the whole system of constraints y: 3 Note that in case of pattern history tale (PHT) containing local histories, some ranches may share a PHT entry: A BHG is derived for each entry of the PHT. B, λ Λ x λ = xπ where λ = f (π,@) B, d {0,1}, x d = x λ + s d endλ d λ Λ s S λ B, λ Λ, end λ = endλ 0 + endλ 1 Note that in case of a imodal ranch predictor, x π = x and f (π,@) = @. For each BCG λ one PG λ is generated. B, λ Λ x λ = x λ,c c C B, λ Λ, s S λ, B, λ Λ, B, λ Λ, end λ = c C init λ = c C xλ = x λ,c s d c C s d end λ,c init λ,c B, d {0,1}, m d = λ Λ m λ d Note that if B λ is the first (resp. last) executed with λ then init λ = init λ,c = 1 (resp. end λ = end λ,c = 1). c C c C Configuration of the framework The gloal approach was introduced y Li et al. [7]. The first model was for a 1-it gloal predictor whose BHT is indexed y the ranch history. Their model accounts for interferences due to BHT entries sharing. This model corresponds to the following configuration of our framework: - ranch history: gloal - index of the BHT: λ = f (π,@) = π - size of the prediction counter: 1 it In [4], a model of a ranch predictor ased on 2-it counters was introduced and we used it to uild the models of two 2-it ranch predictors: a imodal predictor (PCased indexing) and a gloal 2-it predictor whose BHT 7

is indexed y a gloal ranch history. This model corresponds to the following configuration of our framework: - ranch history: gloal - index of the BHT: λ = f (π,@) = π or λ = f (π,@) = @ - size of the prediction counter: 2 its In this paper, we present the model in a generic way that eases the understanding of the set of constraints. Furthermore, this framework allows to model any ranch predictor which would use the BHT. The configuration of the framework depends on: - size of the BHT - size of the history - ranch history: gloal or local - index of the BHT: λ = f (π,@) - size of the prediction counter: 2 its or 1 it Note that the index may e any index that could e generated from the ranch instruction address and a ranch history. Furthermore, they may e more than one history if a PHT is used. The framework is implemented in OTAWA: a WCET computation tool [1]. For a given ranch predictor, this tool may compute the WCET y integrating the constraints in the whole ILP system of constraints. It may also generate separately the system of constraints for the ranch predictor model that could e added in any other WCET tool. This only requires to use the same variales for the CFG nodes. 5. Experimental results Rememer the models uilt y our framework are to e included in the IPET formulation of WCET computation to take into account the modeled ranch predictor. In this section, our goal is to give an insight into how complex the generated models can e. In other words, we evaluate the gloal approach y comparing some models generated y our framework. With this information the end-user may retain or reject a possile target core with dynamic ranch prediction, depending on the resources (including time) he is ready to allocate to timing analysis. This may e useful when the development processes rely on very short steps for new code generation and analysis etween two test phases. Before this, we report some ounds on the WCET that we evaluated for some enchmark codes. They show that taking dynamic ranch prediction into account when performing WCET analysis is really needed to avoid the overestimation of considering every conditional ranch as mispredicted. Using the framework, we have generated models for three ranch predictors ased on 2-it counters. They differ in how the BHT is indexed. The first one (BP @ ) is the so-called imodal predictor that selects a BHT entry from the ranch address (direct-mapping scheme: λ = f (π,@) = @). The second one (BP H ) uses the gloal history (uilt from the outcomes of the last executed ranches) as an index to the BHT (λ = f (π,@) = π). The BP @ BP H BP @ H ficall -27.7% -25.0% -27.7% insertsort -10.9% -10.1% -9.1% jfdctint -4.5% -3.7% -3.5% matmul -8.1% -7.8% -6.9% qurt -11.2% -5.4% -9.8% Tale 4. Impact on the estimated WCET third one (BP @ H ), known as gshare, computes the index to the BHT y XOR-ing some its of the gloal history and some its of the ranch PC (λ = f (π,@) = π @). For the three of them, we assume a 16-entry Branch History Tale (choosing such a small size was dictated y the small size of our enchmarks). For the two last ones, we consider a 4-it gloal history. Bits 4 to 7 of the ranch instrction address are used to index the BHT when the ranch predictor is imodal or gshare. Our experiments consider a processor with a 2-way superscalar 4-stage pipeline that implements out-of-order execution. We assume that the system includes SRAM memories for instructions and data, with a short (1-cycle) latency. The models for dynamic ranch predictors have een implemented in OTAWA, our WCET computation tool [1]. Implementing a ranch predictor model consists in adding specific constraints to the asic IPET formulation. We consider a set of enchmarks that elong to the SNU-RT collection 4 and are frequently considered for the evaluation of WCET analysis techniques. They are listed in Tale 3. 5.1. Impact of ranch prediction modeling on the WCET Tale 4 shows how modeling dynamic ranch prediction improves the accuracy of WCET estimates: taking into account the ehavior of the ranch predictor instead of considering each ranch as mispredicted. Results are provided for each of the three predictors considered in this paper. For most of the enchmarks, taking dynamic ranch prediction into account improves the estimated WCET y 5% to 10%, and the improvement even reaches more than 25% for the ficall program. The weakest results are for the jfdctint program that contains only three ranches executed in sequence. These results show how important it is to model dynamic ranch prediction when the WCET accuracy is critical. However, modeling ranch prediction has a cost in terms of WCET analysis time. This will e shown elow and should e considered when selecting a processor with dynamic ranch prediction. 5.2. Modeling complexity of dynamic ranch predictors In an earlier paper [4], we argued that the ILP modeling complexity of a particular scheme may e quantified y the numer of constraints (C), numer of variales (V ) 4 http://archi.snu.ac.kr/realtime/enchmark/ 8

Benchmark # s # chs Function ficall 7 1 Summing the Fionacci series insertsort 8 2 Insertion sort for 10 integer numers jfdctint 13 3 JPEG slow-ut-accurate integer implementation of the forward DCT (Discrete Cosine Transform) matmul 19 5 Matrix multiplication qurt 74 21 Root computation of quadratic equations Tale 3. Benchmarks Benchmark # cons. (C) # vars (V ) arity (A) ficall 18 16 3 insertsort 21 19 3 jfdctint 32 30 3 matmul 46 44 3 qurt 152 170 5 BP @ BP H BP @ H ficall 1 0.8 0.9 insertsort 1.4 2.6 1.6 jfdctint 1.5 2.3 1.9 matmul 1.1 2.7 1.4 qurt 2.5 6.5 3.8 Tale 5. Initial complexity of the ILP formulation Tale 7. Aliasing to the BHT (# ranches sharing an entry) and the arity (A) which is defined as the maximum numer of variales per constraint in the system. In [4], on a numer of enchmarks, we oserved that the time to solve the ILP prolem (i.e. to determine an upper ound of the WCET considering dynamic ranch prediction) was closely related to these three values. The modeling complexity helps in getting information aout the complexity of the models independently of the solver that may e used. In [4], we provided formulas that express the three complexity values as a function of the program characteristics (e.g. numer of conditional ranches, numer of possile history values for each ranch, etc.). Further details, including the modeling complexity equations of the three ranch predictors compared in this section, are given in a technical report [10] which forms an extended version of this paper. In the following, we use (C,V,A) triplet to evaluate the complexity of the model for a given dynamic ranch predictor. Tale 6 shows the the modeling complexity of three dynamic ranch predictors. For all the enchmarks, the imodal predictor (BP @ ) is the one that exhiits the lowest complexity y far. This is due to the asence of constraints that model the changes on the gloal history (since this predictor does not use it). On the contrary, the history indexed predictor (BP H ) has the largest complexity due to the fact that several ranches may share the same BHT entry. This occurs less often with the gshare (BP @ H ) predictor that enhances the distriution of ranches over the tale entries. As a result, the complexity of BP @ H is generally lower than that of BP H. An exception is for ficall that has a very small numer of ranches. Tale 7 shows the mean numer of ranches that share a BHT entry, which quantifies the so-called phenomenon of aliasing. The imodal (BP @ ) and gshare (BP @ H ) predictors generally exhiit less conflicts than the pure gloal history-ased predictor (BP H ). It appears that the program # constraints (C) # variales (V ) arity (A) ficall 1.4 1.8 1.9 insertsort 3.2 3.6 3.4 jfdctint 2.7 3.0 3.3 matmul 3.2 4.0 4.4 qurt 2.5 3.1 6.7 Tale 8. Impact of the history length on the complexity (4-it vs 2-it) with the highest rate of conflicts (qurt) is the one that exhiit the highest complexity y far. Our measurements also show that this program is the one that necessitates the longest time to solve the ILP prolem ( 200 ms while the other programs needed 0.2 ms for the imodal predictor). Tale 8 gives the modeling complexity of a gloal ranch predictor (BP H ) when considering two sizes for the prediction history register: 2 its and 4 its. Whatever the quantifier (C, V or A), the complexity significantly increases with the history register length. The results presented aove highlight the factors that influence the modeling complexity: the indexing scheme to the BHT, the length of the history register, the average numer of ranches sharing an entry in the BHT. Indexing the BHT with the gloal history register (BP H ) generates a high complexity, and this is expanded y a larger history register. Using an XOR operator (BP @ H ) to comine the history and the ranch PC reduces the aliasing phenomenon and clearly reduces the complexity. We claim that these results should e kept in mind when selecting a processor with dynamic ranch prediction, at least when the computational complexity of WCET analysis is an issue. 9

# constraints (C) # variales (V ) arity (A) BP @ BP @ H /BP @ BP H /BP @ BP @ BP @ H /BP @ BP H /BP @ BP @ BP @ H /BP @ BP H /BP @ ficall 2.3 6.4 4.4 2.6 8.6 6.5 3.0 5.4 5.4 insertsort 4.0 17.2 51.2 4.8 27.1 90.2 3.0 13.5 46.8 jfdctint 3.1 10.8 28.5 3.3 14.5 42.6 3.0 10.2 40.2 matmul 3.4 7.8 24.5 4.0 9.6 40.8 3.0 6.9 56.1 qurt 4.4 10.3 28.1 5.2 13.0 45.8 3.4 10.5 81.6 average 3.4 10.5 27.3 19.9 14.6 45.1 3.1 9.3 46.0 Tale 6. Modeling complexity of 2-it dynamic ranch predictors 6. Conclusion In hard real-time systems, the execution of programs must meet deadlines. An upper ound on the worst-case execution time is computed to check whether timing constraints can e guaranteed. Static WCET analysis requires a detailed modeling of the execution of asic locks in the processor, which has long een restricted to very simple cores. However, performance requirements now lead to use more complex processors that implement sophisticated schemes, such as dynamic ranch predictors which we focus on in this paper. Developing a model for a dynamic ranch predictor is a huge work. Several solutions have reported in the literature as mentioned in Section 1 ut each of them is applicale to a given scheme (i.e. with 1-it counters and local history registers) and cannot easily e extended to other ones. In this paper, we introduced a framework that accounts for the most common functionalities of dynamic ranch predictors and makes it possile to generate a model of a new predictor in an automatic way. We descried each step of the framework in sufficient details so that the reader can easily derive its own models. In the evaluation section, we use our framework to generate models for three kinds of dynamic ranch predictors ased on 2-it counters (y far the most frequently used) and, for some of them, on a gloal history register, which exacerates the interactions etween ranches. We show that modeling the ranch predictor instead of considering each ranch as mispredicted leads to tighter WCET estimates. However, our results also show that the complexity of the models is high, especially for the most sophisticated predictors. It generally overcomes the complexity of the integer linear program used to compute the WCET. Since taking ranch prediction into account when determining WCETs may not e affordale, our recommendation is to take this study into consideration when selecting a processor to run time-critical software. As future work, we plan to extend this framework to complementary solutions that focus on the analysis of the Branch Target Buffer, such as [5, 6]. References Software Technologies for Emedded and Uiquitous Systems, volume 6399 of Lecture Notes in Computer Science, pages 35 46. Springer Berlin / Heidelerg, 2011. [2] I. Bate and R. Reutemann. Efficient integration of imodal prediction and pipeline analysis. In 11 th international conference on emedded and Real-Time Computing Systems and Applications (RTCSA 05), pages 39 44, august 2005. [3] C. Burguière and C. Rochange. A Case for Static Branch Prediction Modeling in Real-Time Systems. In Proceedings of the 11 th IEEE International Conference on Emedded and Real-Time Computing Systems and Applications(RTCSA 05), pages 33 38, august 2005. [4] C. Burguière and C. Rochange. On the Complexity of Modeling Dynamic Branch Predictors when Computing Worst-Case Execution Time. In Proceedings of the ERCIM/DECOS Workshop On Dependale Emedded Systems, august 2007. [5] A. Colin and I. Puaut. Worst-case execution time analysis for a processors with ranch prediction. Journal on Real- Time Systems, 18(2-3):249 274, may 2000. [6] D. Grund, J. Reineke, and G. Gehard. Branch target uffers: WCET analysis framework and timing predictaility. Journal of Systems Architecture, 2010. [7] X. Li, T. Mitra, and A. Roychoudhury. Modeling Control Speculation for Timing Analysis. Journal on Real-Time Systems, 29(1):27 58, january 2005. [8] Y.-T. Li and S. Malik. Performance analysis of emedded software using implicit path enumeration. In Proceedings of the ACM/SIGPLAN Workshop on Languages, Compilers and Tools for Emedded Systems (LCTES 95), volume 30, pages 88 98, 1995. [9] T. Lundqvist and P. Stenström. A method to improve the estimated worst-case performance of data caching. In 6 th International Conference on Real-Time Computing Systems and Applications (RTCSA 99), decemer 1999. [10] C. Maïza and C. Rochange. A framework for the timing analysis of dynamic ranch predictors. Technical Report IRIT/RR 2011-12-FR, IRIT, 2011. [11] J. Smith. A Study of Branch Prediction Strategies. In Proceedings of the 8 th ACM/IEEE International Symposium on Computer Architecture (ISCA 1981), pages 135 148, 1981. [12] T.-Y. Yeh and Y. N. Patt. Alternative Implementation of the Two-Level Adaptative Branch Prediction. In Proceedings of the 19 th ACM/IEEE International Symposium on Computer Architecture (ISCA 1992), pages 124 134, 1992. [1] C. Ballariga, H. Cassé, C. Rochange, and P. Sainrat. OTAWA: An Open Toolox for Adaptive WCET Analysis. In S. Min, R. Pettit, P. Puschner, and T. Ungerer, editors, 10