Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

Similar documents
Portland State University ECE 587/687. Branch Prediction

Branch Prediction using Advanced Neural Methods

A Detailed Study on Phase Predictors

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

Fall 2011 Prof. Hyesoon Kim

ICS 233 Computer Architecture & Assembly Language

A Novel Meta Predictor Design for Hybrid Branch Prediction

AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE

CPSC 3300 Spring 2017 Exam 2

Branch History Matching: Branch Predictor Warmup for Sampled Simulation

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Limits to Branch Prediction

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Vector Lane Threading

Measurement & Performance

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Measurement & Performance

CS 700: Quantitative Methods & Experimental Design in Computer Science

Professor Lee, Yong Surk. References. Topics Microprocessor & microcontroller. High Performance Microprocessor Architecture Overview

Exploiting Bias in the Hysteresis Bit of 2-bit Saturating Counters in Branch Predictors

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Performance, Power & Energy

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm

[2] Predicting the direction of a branch is not enough. What else is necessary?

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

BETTER BRANCH PREDICTION THROUGH PROPHET/CRITIC HYBRIDS

Unit 6: Branch Prediction

Counters. We ll look at different kinds of counters and discuss how to build them

256.bzip2, ref.graphic. Datasets profile vs. Reference Dataset. 256.bzip2, ref.graphic

Lecture 3, Performance

Lecture 3, Performance

Bus transit = 20 ns (one way) access, each module cannot be accessed faster than 120 ns. So, the maximum bandwidth is

CSCI-564 Advanced Computer Architecture

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Fall 2008 CSE Qualifying Exam. September 13, 2008

Power-Aware Branch Prediction: Characterization and Design

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

256.bzip2, ref.source. Datasets profile vs. Reference Dataset. 256.bzip2, ref.source

[2] Predicting the direction of a branch is not enough. What else is necessary?

A New Multiple Weight Set Calculation Algorithm

/ : Computer Architecture and Design

Department of Electrical and Computer Engineering The University of Texas at Austin

Design at the Register Transfer Level

Project Two RISC Processor Implementation ECE 485

Árpád Gellért Lucian N. Vinţan Adrian Florea. A Systematic Approach to Predict Unbiased Branches

Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Fast Path-Based Neural Branch Prediction

Lecture 13: Sequential Circuits, FSM

CMP 338: Third Class

1 RN(1/y) Ulp Accurate, Monotonic

Loop Interchange. Loop Transformations. Taxonomy. do I = 1, N do J = 1, N S 1 A(I,J) = A(I-1,J) + 1 enddo enddo. Loop unrolling.

A Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding

ECE 341. Lecture # 3

Potentials of Branch Predictors from Entropy Viewpoints

CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS KUNXIANG YAN, B.S. A thesis submitted to the Graduate School

CSE370: Introduction to Digital Design

Introduction The Nature of High-Performance Computation

A HIGH-SPEED PROCESSOR FOR RECTANGULAR-TO-POLAR CONVERSION WITH APPLICATIONS IN DIGITAL COMMUNICATIONS *

Logic BIST. Sungho Kang Yonsei University

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

Computer Science Introductory Course MSc - Introduction to Java

A framework for the timing analysis of dynamic branch predictors

ENEE350 Lecture Notes-Weeks 14 and 15

Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur

A Fast Head-Tail Expression Generator for TCAM Application to Packet Classification

Lecture 13: Sequential Circuits, FSM

Built-In Test Generation for Synchronous Sequential Circuits

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

4. (3) What do we mean when we say something is an N-operand machine?

Accurate Estimation of Cache-Related Preemption Delay

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

On my honor, as an Aggie, I have neither given nor received unauthorized aid on this academic work

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

On Application of Output Masking to Undetectable Faults in Synchronous Sequential Circuits with Design-for-Testability Logic

Cache-based Query Processing for the Boolean Retrieval Model

UNIVERSITY OF WISCONSIN MADISON

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

ECE 571 Advanced Microprocessor-Based Design Lecture 9

Test Pattern Generator for Built-in Self-Test using Spectral Methods

ww.padasalai.net

Primary Outputs. Primary Inputs. Combinational Logic. Latches. Next States. Present States. Clock R 00 0/0 1/1 1/0 1/0 A /1 0/1 0/0 1/1

The Non-existence of Finite Projective Planes of. Order 10. C. W. H. Lam, L. Thiel, and S. Swiercz. 15 January, 1989

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

7 Multipliers and their VHDL representation

EECS150 - Digital Design Lecture 25 Shifters and Counters. Recap

Fault Modeling. 李昆忠 Kuen-Jong Lee. Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan. VLSI Testing Class

ECE 571 Advanced Microprocessor-Based Design Lecture 10

Combinations. April 12, 2006

Multicore Semantics and Programming

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Lecture 2: Metrics to Evaluate Systems

Naive Bayesian classifiers for multinomial features: a theoretical analysis

Goals for Performance Lecture

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

Improving Memory Hierarchy Performance Through Combined Loop. Interchange and Multi-Level Fusion

DSP Design Lecture 5. Dr. Fredrik Edman.

Transcription:

An Enhanced Two-Level Adaptive Multiple Branch Prediction for Superscalar Processors Jong-bok Lee, Soo-Mook Moon and Wonyong Sung fjblee@mpeg,smoon@altair,wysung@dspg.snu.ac.kr School of Electrical Engineering, Seoul National University San 56-1 ShinLim-Dong, KwanAk-Gu, Seoul 151-742, Korea Abstract This paper proposes an enhanced method of multiple branch prediction using a per-primary branch history table. This scheme improves the previous ones based on a single global branch history register, by reducing interferences among histories of dierent branches caused by sharing a single register. This scheme also allows the prediction of a branch not to aect the prediction of other branches that are predicted in the same cycle, thus allowing independent and parallel prediction of multiple branches. Our experimental results indicate that these features help to achieve higher prediction accuracy than that of the previous global history scheme (which is already high) with the less hardware cost (i.e., 96.1% vs. 95.1% for integer code and 95.7% vs. 94.9% for oating-point code including nasa7, for a given hardware budget of 128K bits). Moreover, the increased prediction accuracy causes better fetch bandwidth of a superscalar machine (i.e., 7.1 vs. 6.9 instructions per clock cycle for integer code and 11.0 vs. 10.9 instructions per cycle for oating-point code).

1 Introduction In order to increase the fetch bandwidth of a superscalar processor, we need to predict more than one branch and to fetch multiple non-consecutive basic blocks in a single cycle. Yeh and Patt developed a two-level adaptive branch prediction scheme [1, 2, 3] and extended it to predicting multiple branches per cycle [4]. Several variations of the scheme have been introduced in [4], yet all of them uses a global history register in common which makes each branch to share the same space for storing its prediction history. An obvious problem of the global history register is the interferences between dierent branches caused by the sharing. Moreover, those multiple branches that need to be predicted simultaneously includes a dependency in their prediction, which lowers the prediction accuracy as the number of simultaneously predicted branches increases. In order to overcome these shortcomings, this paper proposes an enhanced two-level adaptive multiple branch prediction using a per-primary address branch history table. In this scheme, only those branches that are predicted simultaneously share the same space, thus reducing interferences. Moreover, those multiple branches that are predicted in the same cycle are not constrained by any dependences, thus being predicted independently. We propose several hardware congurations for the per-primary address branch history scheme and compare with the previous ones through simulation. The performance is evaluated by conducting an empirical study on a subset of SPEC benchmark suite using the trace-driven simulation. Our results indicate that the proposed scheme improves the branch prediction accuracy, hence the fetch bandwidth of superscalar processors under the same hardware budget. The rest of this paper is organized as follows. Section 2 briey reviews the two-level adaptive branch prediction and the previous multiple branch prediction schemes. Section 3 describes the per-primary address history scheme. Section 4 presents the simulation environments and results. 1

Finally, a summary follows in Section 5. 2 Previous Two-Level Adaptive Multiple Branch Prediction Many branch prediction schemes that utilize the run-time execution history have been proposed [5, 6, 7], yet two-level adaptive branch prediction is known to obtain the highest prediction accuracy [1, 2, 8]. Two-level adaptive branch prediction uses two major data structures, the branch history register (BHR) and the pattern history table (PHT), as shown in Figure 1 (a). The Pattern History Table Branch History Pattern 00...00 00...01 Global History Register k Pattern History Table Branch History Register 00...10 Pattern History Bits Prediction k 1 Secondary Branch Prediction 1 1... 1 0 Index select 11...10 k Primary Branch Prediction 11...11 State Transition Logic (a) The basic structure. (b) The multiple branch prediction. Figure 1: Two-level adaptive multiple branch prediction. BHR is used to record the history of taken and not-taken for branches. For each possible pattern in the BHR, a pattern history is recorded in the PHT. When the BHR contains k bits to record the history of the last k branches, there are 2 k possible patterns in the BHR. Hence, the PHT has 2 k entries, each of which contains a 2-bit up-down saturating counter to record the execution history of the corresponding pattern occurred in the BHR. The counter is incremented when the result of a branch is taken; otherwise, the counter is decremented. Branch prediction is made based on the interpretation of both pattern history bits as in Figure 1 (a). The method of multiple branch prediction proposed by Yeh and Patt employs a simple ex- 2

tension of the above scheme [4]. The extended global history scheme makes the prediction of an immediately following branch and extrapolates the predictions of subsequent branches. As shown in Figure 1(b), all k bits in the history register are used to index into the PHT to make a primary branch prediction. To predict the secondary branch, the right-most k-1 branch history bits are used to index into the PHT. Since not all the k bits are used as an index, k-1 bits address 2 adjacent entries in the PHT. Then, the primary branch prediction is used to select one of the entries to make the secondary branch prediction. Similarly, the tertiary prediction uses the right-most k-2 history register bits to address the PHT and accesses 4 adjacent entries. The primary and secondary predictions are used to select one of the 4 entries for the tertiary branch path prediction. Since a global BHR and a global PHT is employed in this scheme, it is called as Two-Level Adaptive Multiple Branch Prediction Using a Global History Register and a Global Pattern History Table (MGAg). When multiple PHTs are employed in this scheme for each primary branch address, it is called as Two-Level Adaptive Multiple Branch Prediction using a Global History Register and Per-primary address Pattern History Tables (MGAp). The MGAp includes several disadvantages. As with MGAg, the prediction of a branch is interfered by the history of other branches due to the use of a single global history register. There is also a problem in the prediction mechanism of multiple branches. For example, when two branches are predicted, the prediction of the secondary branch is based on the yet unresolved prediction value of the primary branch; if the primary branch is mispredicted, the secondary branch can be also mispredicted. This dependence causes the sequential generation of prediction values of multiple branches, which might aect the cycle time since the table lookup for prediction already requires a considerable amount of time. Another method of multiple branch prediction has been proposed by Dutta and Franklin 3

where a tree-like subgraph of the control ow graph was employed [9]. In this scheme, multiple branches are predicted indirectly by predicting a path in the subgraph. The advantage of this is performing multiple branch predictions in a cycle without determining the address of these branches. However, instead of not storing the condensed history of all branches, the subgraph history pattern must be stored. Unfortunately, the reported branch prediction accuracy is not higher than that of Yeh or its proposed multiple branch predictor which employs a 2-bit saturating updown counter for each PHT entry. 3 Multiple Branch Prediction Using a Per-Primary Address Branch History Table Our scheme of multiple branch prediction is simple and straightforward. In order to reduce interferences in the rst level of branch histories, one history register should be provided for each distinct primary branch which introduces the Branch History Table (BHT). For two branch predictions where the branch history length is k bits, each entry of the BHT is composed of a BHR with the length of 2k bits. In addition, separate PHTs are employed for each primary branch address. Each entry of the new PHT contains a single 2-bit saturating up-down counter like the original PHT. Both primary and secondary branches are predicted by accessing the BHT and the PHT using the primary branch address. When two branches are predicted each cycle, the rst half k bits and the second half k bits in the history register are used to index into the PHT separately for the prediction of the primary and the secondary branch, respectively. When three branches are predicted at each cycle, each entry of the BHT is composed of 3k bits and last k bits are used 4

for the prediction of the tertiary branch. Only the primary branch address is used for accessing the BHT and the PHT, since the branch addresses of the secondary and the tertiary branch are not known at the time of prediction. This multiple predictor is referred to as Two-level Adaptive Multiple Branch Prediction using a Per-Primary Address Branch History Table and Per-Primary Address Pattern History Tables (MPAp). Figure 2 depicts the prediction mechanism of the MPAp scheme for two branch predictions. Primary Branch Address Branch History Table Pattern History Table 2 k k k Secondary Branch Prediction primary Secondary k Primary Branch Prediction Figure 2: The MPAp scheme for two branch predictions. In the previous global history scheme, the prediction of a branch is aected by the history of other branches since all branch predictions are based on a single global history register. In our scheme, however, only two secondary branches share a single BHR associated with each primary branch address for the case of two branch predictions. For the case of three branch prediction, the two secondary branches and the four tertiary branches are shared. Figure 3 (a) and (b) show which basic blocks share the same BHR when two and three branches are predicted, respectively. Although there still exist interferences among those branches that share the same BHR in the same cycle, they are much less compared to the global history scheme. Another advantage of this scheme is that it does not cause dependences among those simultaneously predicted branches. Unlike the global history scheme, this method independently performs 5

primary branch primary branch secondary branch secondary branch tertiary branch (a) two branch predictions (b) three branch predictions Figure 3: Number of accessing basic blocks per each primary address. the prediction of multiple branches once the primary branch address is known. Consequently, multiple branches can be predicted in parallel which allows a faster prediction. 4 Experimental Results In order to compare the prediction accuracy of our scheme with that of the previous global history scheme, we have performed a comprehensive empirical study for various hardware congurations, considering the implementation costs. We also compare the impact of prediction accuracy on the fetch bandwidth of a superscalar machine. 4.1 Experimental Environment We use the trace-driven simulation using ten programs in SPEC benchmarks. Four integer programs are eqntott, espresso, xlisp, and gcc. Six oating-point programs are nasa7, doduc, spice2g6, tomcatv, matrix300, and fpppp. These programs are compiled by C and Fortran 77 with the compiler optimizations turned on. The tracing system is based on SPARCstation 2 [10]. In order to obtain the instruction traces of our benchmark programs, a tool called Shadow is used [11]. Each 6

benchmark is traced for ten million instructions which are fed into the multiple branch predictor. In order to obtain the trace uniformly from the wide execution range of each benchmark, two million instructions are sampled for ve times to trace up to the rst fty million instructions. The Branch Address Cache (BAC) has 1024 entries with the set associativity of four. The conguration of the Branch History Table (BHT) is also 1024-entry and 4-way set associative, utilizing the LRU (Least-Recently-Used) algorithm for replacement. We vary the BHR length from 4 to 14 bits and vary the number of tables for the PHT from 1 to 256. The size of our instruction cache is 32K bytes with the block size of 16 bytes, and it is 8-way interleaved with the set associativity of two. The miss penalty is 4 cycles. We assume that a fetch address can access two banks simultaneously. Consequently, a maximum of 16 instructions can be supplied from the instruction cache to the processing unit in each cycle, which is the maximum fetch bandwidth. We compare our multiple branch predictor MPAp, with the original multiple branch predictor MGAp of Yeh and Patt [4]. For clarication, the MPAp and the MGAp will also be called interchangeably the per-primary address history scheme and the global history scheme, respectively. 4.2 The Results 4.2.1 Opportunities for Multiple Branch Prediction As we have described above, maximum 16 instructions can be fetched from the instruction cache in each cycle. When a basic block is large as in most oating-point benchmarks, the chance of performing multiple branch prediction is low. Figure 4 (a) and (b) describe the distribution of execution cycles depending on the number of branches predicted in each cycle for two and three branch predictions, respectively. 7

eq es li gc na dd sp tc mt fp Branch prediction utilization Branch prediction utilization two branch predictions per cycle three branch predictions per cycle 110.0 120.0 0-branch prediction 1-branch prediction 110.0 0-branch prediction 1-branch prediction 2-branch prediction 2-branch prediction 3-branch prediction 80.0 branch prediction utilization[%] 70.0 60.0 50.0 40.0 branch prediction utilization[%] 80.0 70.0 60.0 50.0 40.0 30.0 30.0 20.0 20.0 10.0 10.0 0.0 0.0 eq es li gc na dd sp tc mt fp benchmarks benchmarks (a) two branch predictions (b) three branch predictions Figure 4: Branch prediction utilization when two and three basic blocks are fetched. Zero-branch prediction occurs when we are fetching a long sequential segment of code or when the fetch address misses in the BAC. Floating-point programs have a relatively high frequency of zero-branch prediction due to their extremely long sequential code segment which is executed repeatedly. Only the spice2g6 includes many opportunities of multiple branch prediction since its average basic block size on the trace is small (i.e., 4.3 instructions). For integer benchmarks, the average percentages of cycles for zero, one, and two branch predictions are 4.5%, 50.0%, and 45.6%, respectively, for two branch predictions. For oating-point benchmarks, the values are 71.1%, 24.2%, and 21.3%, respectively. For three branch prediction, the distribution of cycles for integer benchmarks is 4.5% (0), 33.3% (1), 33.1% (2), and 29.2% (3), while that of oating-point benchmarks is 70.6% (0), 17.5% (1), 16.3% (2), and 12.2% (3), respectively. 8

4.2.2 Branch Prediction Accuracy Figure 5 (a) depicts the average prediction accuracy of integer benchmarks when MPAp and MGAp are used for two branch predictions, as a function of BHR lengths and the number of tables for the PHT. The accuracy of MGAp is sensitive to the BHR length and the number of tables for the PHT. When the BHR length is only 4-bit and a single PHT is employed, the average prediction accuracy is below 78 percent. We need to have the 14-bit length of BHR and 256 PHTs to obtain the maximum prediction accuracy for MGAp. The average prediction accuracy for integer programs ranges from 77.7% to 96.5% depending on the size of the hardware. Branch Prediction Accuracies Branch Prediction Accuracies Integer Benchmarks, 2-branch Floating Point Benchmarks, 2-branch 99.0 98.0 97.0 99.0 98.0 96.0 97.0 95.0 94.0 96.0 93.0 95.0 Prediction Accuracies [%] 92.0 91.0 89.0 88.0 87.0 86.0 85.0 MPAp, PHT=1 MPAp, PHT=16 Prediction Accuracies [%] 94.0 93.0 92.0 91.0 89.0 MPAp, PHT=1 MPAp, PHT=16 84.0 83.0 82.0 81.0 MPAp, PHT=256 MGAp, PHT=1 MGAp, PHT=16 MGAp, PHT=256 88.0 87.0 86.0 MPAp, PHT=256 MGAp, PHT=1 MGAp, PHT=16 MGAp, PHT=256 80.0 79.0 78.0 85.0 84.0 77.0 4 6 8 10 12 14 BHR Length [bits] 83.0 4 6 8 10 12 14 BHR Length [bits] (a) integer programs (b) oating-point programs Figure 5: The prediction accuracies of MPAp and MGAp scheme for two branch predictions. On the other hand, the prediction accuracy of MPAp is not quite sensitive to the BHR length and the number of PHTs. Using the 4-bit BHR and a single PHT, we can obtain the prediction accuracy of 92%, which outperforms MGAp by more than 14%. The average prediction accuracy for integer programs ranges from 92.0% to 96.9% depending on the size of the hardware. Figure 5 (b) shows the average prediction accuracy of oating-point programs with the same 9

hardware congurations as above. For oating-point benchmarks, the accuracy curve of MGAp is less sensitive to the BHR length and the number of PHTs than in integer programs. This is due to the periodic branch behavior of oating-point programs which makes their branches easier to predict. The average prediction accuracy of MGAp ranges from 87.4% to 95.5% 1. Whereas, the average prediction accuracy of MPAp ranges from 94.8% to 95.8%, again less sensitive to the size of the hardware. Figure 6 shows the branch prediction accuracy for three branch predictions which exhibits a similar curve with the result of two branch predictions in Figure 5. One thing to note is that the prediction accuracy of three branch predictions is lower than that of two branch predictions, which is what we have expected due to additional interferences caused by more sharing. However, the dierence is much lower in MPAp compared to MGAp, which indicates the stability of our scheme. 4.2.3 Comparison of Prediction Accuracy under the Same Hardware Budget We compare the prediction accuracy of MGAp and MPAp under the same hardware budget. Table 1 describes the estimation of hardware costs (number of bits) for MGAp and MPAp as a function of the BHR length and the number of PHTs. Given 128K bits of hardware budget, the best prediction accuracy of MGAp and MPAp for two branch predictions can be obtained with MGAp(8,256) and MPAp(10,16), respectively. Applying the hardware cost function, the cost of MGAp(8,256) is exactly 128 K bits, whereas that of MPAp(10,16) is only 112 K bits. Figure 7 (a) compares these best prediction accuracies 1 Our prediction accuracy of MGAp appears to be a little lower than the result of Yeh and Patt because we include the result of nasa7 which is missing in theirs. The prediction accuracy of nasa7 is the lowest and reduces the average accuracy of oating-point benchmarks from around 98% down to 95%. 10

4 6 8 10 12 14 Branch Prediction Accuracies Branch Prediction Accuracies Integer Benchmarks, 3-branch Floating Point Benchmarks, 3-branch 99.0 98.0 97.0 99.0 98.0 96.0 97.0 95.0 94.0 93.0 96.0 95.0 92.0 94.0 Prediction Accuracies [%] 91.0 89.0 88.0 87.0 86.0 85.0 84.0 83.0 82.0 81.0 80.0 MPAp, PHT=1 MPAp, PHT=16 MPAp, PHT=256 MGAp, PHT=1 MGAp, PHT=16 MGAp, PHT=256 Prediction Accuracies [%] 93.0 92.0 91.0 89.0 88.0 87.0 86.0 85.0 MPAp, PHT=1 MPAp, PHT=16 MPAp, PHT=256 MGAp, PHT=1 MGAp, PHT=16 MGAp, PHT=256 79.0 84.0 78.0 77.0 76.0 83.0 82.0 75.0 81.0 4 6 8 10 12 14 BHR Length [bits] BHR Length [bits] (a) integer programs (b) oating-point programs Figure 6: The prediction accuracies of MPAp and MGAp scheme for three branch predictions. of both congurations for each benchmark. For most benchmarks, MPAp(10,16) outperforms MGAp(8,256). The average prediction accuracy of MGAp(8,256) is 95.1% (integer) and 94.9% (oating-point), whereas that of MPAp (10,16) is 96.1% (integer) and 95.7% (oating-point). Figure 7 (b) depicts the same graph when we are given 512K bits of hardware budget where the best prediction accuracy is obtained by MGAp(10,256) and MPAp(12,16), respectively. The cost of MGAp(10,256) ts to 512 K bits exactly, whereas that of MPAp(12,16) is only 224 K bits. However, the graph of latter shows a similar yet a higher prediction accuracy. The average prediction accuracy of MGAp(8,256) is 95.9% (integer) and 95.3% (oating-point), whereas that of MPAp (10,16) is 96.4% (integer) and 95.7% (oating-point). It is very encouraging that our proposed scheme outperforms the previous one with only 44% of hardware cost. The simulation is repeated for three branch predictions, as shown in Figure 8. Given 128 K bits of hardware budget, MGAp(8,256) and MPAp(8,16) are selected. The cost of MGAp(8,256) is unchanged with the number of predicted branches, whereas the actual hardware cost of the 11

Table 1: Branch predictor congurations and their estimated costs; b is the number of entries in the BHT; s is the set associativity of the BHT; m is the number of branches (2 or 3) predicted per cycle. scheme BHR number of hardware name length PHTs cost MGAp(h,p) h p h+p2 h 2 MPAp(h,p) h p bshm+p2 h 2 MPAp(8,16) is only 104 K bits. The average prediction accuracies of MGAp(8,256) are 93.6% and 93.8% for integer and oating point programs, respectively. For MPAp(8,16), they are increased to 94.2% and 95.1%. For the hardware budget of 512 K bits, MGAp(10,256) and MPAp(12,16) are simulated. For MGAp(10,256) where the cost is exactly 512 K bits, the average prediction accuracies are 94.4% and 95.1% for integer and oating point programs. Whereas, the respective accuracies of MPAp(12,16) are enhanced to 94.6% and 95.2%, with only 272 K bits resulting in 53% of hardware cost. Comparing with the results of the two branch predictions, the prediction accuracies are decreased from 0.2 to 1.8 percents by the increase in the number of sharing. 4.2.4 Fetch Bandwidth of MGAp and MPAp We evaluate how the increased prediction accuracy of MPAp increases the fetch bandwidth of the superscalar machine. We measure the IP C f, the average number of instructions that can be fetched from the instruction cache in each cycle. Figure 9 (a) compares the IP C f of MGAp(8, 256) and MPAp(10,16) for two branch predictions under the budget of 128K bits. For integer benchmarks, the graph indicates the increased prediction accuracy of MPAp(10,16) obtains a 12

eq es li gc na dd sp tc mt fp Branch Prediction Accuracies Branch Prediction Accuracies 2-branch predictions, 128 Kbits 2-branch predictions, 512 Kbits 105.0 105.0 MGAp(8,256) MGAp(10,256) MPAp(10,16) MPAp(12,16) 95.0 95.0 Prediction Accuracies [%] 85.0 Prediction Accuracies [%] 85.0 80.0 80.0 75.0 75.0 70.0 70.0 eq es li gc na dd sp tc mt fp Benchmark Programs Benchmark Programs (a) The implementation of 128K bits (b) The implementation of 512K bits Figure 7: The prediction accuracies of MGAp and MPAp with the same implementation cost for two branch predictions. better fetch bandwidth than MGAp(8, 256), i.e., 7.12 vs. 6.90. For oating-point benchmarks, the increase is 11.02 vs. 10.90. Figure 9 (b) compares the IP C f of MGAp(10,256) and MPAp(12,16) for two branch predictions under 512K bits, which indicates a similar result. For integer benchmarks, the fetch bandwidth of MPAp (12,16) compared with MGAp(8,256) is 7.23 vs. 7.06. For oating-point benchmarks, the increase is 11.03 vs. 10.98. Figure 10 (a) depicts the simulation results of MGAp(8,256) and MPAp(8,16) for three branch predictions. The fetch bandwidth of MPAp(8,16) outperforms MGAp(8,256) again, both for integer (7.53 vs. 7.52) and oating point programs (11.44 vs. 11.30). Finally, Figure 10(b) compares the IP C f of MGAp(10,256) and MPAp(12,16). The fetch bandwidth is increased from 7.78 to 7.91 for integer, and from 11.41 to 11.48 for oating point benchmarks. Three branch predictions obtain a better fetch bandwidth than two branch predictions, although its prediction 13

eq es li gc na dd sp tc mt fp Branch Prediction Accuracies Branch Prediction Accuracies 3-branch predictions, 128 Kbits 3-branch predictions, 512 Kbits 105.0 105.0 MGAp(8,256) MGAp(10,256) MPAp(8,16) MPAp(12,16) 95.0 95.0 Prediction Accuracies [%] 85.0 Prediction Accuracies [%] 85.0 80.0 80.0 75.0 75.0 70.0 70.0 eq es li gc na dd sp tc mt fp Benchmark Programs Benchmark Programs (a) The implementation of 128K bits (b) The implementation of 512K bits Figure 8: The prediction accuracies of MGAp and MPAp with the same implementation cost for three branch predictions. accuracy is lower. This is true because more basic blocks can be fetched simultaneously if the prediction is correct even if the overall prediction accuracy is slightly lower. 5 Summary We have proposed an enhanced mechanism of multiple branch prediction where the interferences among branches are reduced and the prediction of subsequent branches does not depend on the unresolved prediction of the preceding branch, thus improving the overall prediction accuracy. The experimental results indicate that our scheme can achieve a much better prediction accuracy (i.e., as much as 7% to 14% with a 4-bit BHR and with a single PHT) than the previous global history scheme of Yeh and Patt. Even when the hardware budget for multiple branch prediction is kept the same, our scheme still achieves a higher prediction accuracy with lower hardware cost (i.e., as much as 2.7% in some benchmark with only 44% of hardware cost). Finally, the increased 14

eq es li gc na dd sp tc mt fp Instruction Fetch Bandwidth Comparison Instruction Fetch Bandwidth Comparison 2-branch predictions, 128 K bists 2-branch predictions, 512kbits 16.0 16.0 MGAp(8,256) MGAp(10,256) 14.0 MPAp(10,16) 14.0 MPAp(12,16) 12.0 12.0 Instructions per fetch 10.0 8.0 6.0 Instructions per fetch 10.0 8.0 6.0 4.0 4.0 2.0 2.0 0.0 0.0 eq es li gc na dd sp tc mt fp benchmarks benchmarks (a) The implementation of 128K bits (b) The implementation of 512K bits Figure 9: IP C f of MGAp and MPAp for two branch predictions. prediction accuracy results in better fetch bandwidth, which is essential for the performance enhancement in superscalar processors. References [1] T.-Y. Yeh and Y.N. Patt, Two-level adaptive branch prediction, in: Proc. Micro-24, (1991), 51{61. [2] T.-Y. Yeh and Y.N. Patt, Alternative implementations of two-level adaptive branch prediction, in: Proc. ISCA '92, (1992), 124{134. [3] T.-Y. Yeh and Y.N. Patt, A comparison of dynamic branch predictors that use two levels of branch history, in: Proc. ISCA '93, (1993),. [4] T.-Y. Yeh, D.T. Marr, and Y.N. Patt, Increasing the instruction fetch rate via multiple branch prediction and a branch address cache, in: ICS '93, (1993), 67{76. 15

eq es li gc na dd sp tc mt fp Instruction Fetch Bandwidth Comparison Instruction Fetch Bandwidth Comparison 3-branch predictions, 128kbits 3-branch predictions, 512kbits 16.0 16.0 MGAp(8,256) MGAp(10,256) 14.0 MPAp(8,16) 14.0 MPAp(12,16) 12.0 12.0 Instructions per fetch 10.0 8.0 6.0 Instructions per fetch 10.0 8.0 6.0 4.0 4.0 2.0 2.0 0.0 0.0 eq es li gc na dd sp tc mt fp benchmarks benchmarks (a) The implementation of 128K bits (b) The implementation of 512K bits Figure 10: IP C f of MGAp and MPAp for three branch predictions. [5] J.E. Smith, A study of branch prediction strategies, in: Proc. ISCA '81, (1981), 135{148. [6] J.K.L. Lee and A.J. Smith, Branch prediction strategies and branch target buer design, IEEE Computer, 17(1984), 6{22. [7] S. McFarling and J. Henessy, Reducing the cost of branches, in: Proc. ISCA '86, (1986), 396{403. [8] K. So, S.-T. Pan and J.T. Rameh, Improving the accuracy of dynamic branch prediction using branch correlation, in: Proc. ASPLOS-5, (1982), 76{84. [9] S. Dutta and M. Franklin, Control ow prediction with tree-like subgraphs for superscalar processors, in: Proc. ISCA '95, (1995), 258{263. [10] Sun Microsystems, The SPARC Architecture Manual, (Prentice-Hall, 1992). [11] Sun Microsystems, Introduction to SHADOW, (Sun Microsystems, 1989). 16