Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

Size: px

Start display at page:

Download "Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits"

Linda Richardson
5 years ago
Views:

1 An Enhanced Two-Level Adaptive Multiple Branch Prediction for Superscalar Processors Jong-bok Lee, Soo-Mook Moon and Wonyong Sung School of Electrical Engineering, Seoul National University San 56-1 ShinLim-Dong, KwanAk-Gu, Seoul , Korea Abstract This paper proposes an enhanced method of multiple branch prediction using a per-primary branch history table. This scheme improves the previous ones based on a single global branch history register, by reducing interferences among histories of dierent branches caused by sharing a single register. This scheme also allows the prediction of a branch not to aect the prediction of other branches that are predicted in the same cycle, thus allowing independent and parallel prediction of multiple branches. Our experimental results indicate that these features help to achieve higher prediction accuracy than that of the previous global history scheme (which is already high) with the less hardware cost (i.e., 96.1% vs. 95.1% for integer code and 95.7% vs. 94.9% for oating-point code including nasa7, for a given hardware budget of 128K bits). Moreover, the increased prediction accuracy causes better fetch bandwidth of a superscalar machine (i.e., 7.1 vs. 6.9 instructions per clock cycle for integer code and 11.0 vs instructions per cycle for oating-point code).

2 1 Introduction In order to increase the fetch bandwidth of a superscalar processor, we need to predict more than one branch and to fetch multiple non-consecutive basic blocks in a single cycle. Yeh and Patt developed a two-level adaptive branch prediction scheme [1, 2, 3] and extended it to predicting multiple branches per cycle [4]. Several variations of the scheme have been introduced in [4], yet all of them uses a global history register in common which makes each branch to share the same space for storing its prediction history. An obvious problem of the global history register is the interferences between dierent branches caused by the sharing. Moreover, those multiple branches that need to be predicted simultaneously includes a dependency in their prediction, which lowers the prediction accuracy as the number of simultaneously predicted branches increases. In order to overcome these shortcomings, this paper proposes an enhanced two-level adaptive multiple branch prediction using a per-primary address branch history table. In this scheme, only those branches that are predicted simultaneously share the same space, thus reducing interferences. Moreover, those multiple branches that are predicted in the same cycle are not constrained by any dependences, thus being predicted independently. We propose several hardware congurations for the per-primary address branch history scheme and compare with the previous ones through simulation. The performance is evaluated by conducting an empirical study on a subset of SPEC benchmark suite using the trace-driven simulation. Our results indicate that the proposed scheme improves the branch prediction accuracy, hence the fetch bandwidth of superscalar processors under the same hardware budget. The rest of this paper is organized as follows. Section 2 briey reviews the two-level adaptive branch prediction and the previous multiple branch prediction schemes. Section 3 describes the per-primary address history scheme. Section 4 presents the simulation environments and results. 1

3 Finally, a summary follows in Section 5. 2 Previous Two-Level Adaptive Multiple Branch Prediction Many branch prediction schemes that utilize the run-time execution history have been proposed [5, 6, 7], yet two-level adaptive branch prediction is known to obtain the highest prediction accuracy [1, 2, 8]. Two-level adaptive branch prediction uses two major data structures, the branch history register (BHR) and the pattern history table (PHT), as shown in Figure 1 (a). The Pattern History Table Branch History Pattern Global History Register k Pattern History Table Branch History Register Pattern History Bits Prediction k 1 Secondary Branch Prediction Index select k Primary Branch Prediction State Transition Logic (a) The basic structure. (b) The multiple branch prediction. Figure 1: Two-level adaptive multiple branch prediction. BHR is used to record the history of taken and not-taken for branches. For each possible pattern in the BHR, a pattern history is recorded in the PHT. When the BHR contains k bits to record the history of the last k branches, there are 2 k possible patterns in the BHR. Hence, the PHT has 2 k entries, each of which contains a 2-bit up-down saturating counter to record the execution history of the corresponding pattern occurred in the BHR. The counter is incremented when the result of a branch is taken; otherwise, the counter is decremented. Branch prediction is made based on the interpretation of both pattern history bits as in Figure 1 (a). The method of multiple branch prediction proposed by Yeh and Patt employs a simple ex- 2

4 tension of the above scheme [4]. The extended global history scheme makes the prediction of an immediately following branch and extrapolates the predictions of subsequent branches. As shown in Figure 1(b), all k bits in the history register are used to index into the PHT to make a primary branch prediction. To predict the secondary branch, the right-most k-1 branch history bits are used to index into the PHT. Since not all the k bits are used as an index, k-1 bits address 2 adjacent entries in the PHT. Then, the primary branch prediction is used to select one of the entries to make the secondary branch prediction. Similarly, the tertiary prediction uses the right-most k-2 history register bits to address the PHT and accesses 4 adjacent entries. The primary and secondary predictions are used to select one of the 4 entries for the tertiary branch path prediction. Since a global BHR and a global PHT is employed in this scheme, it is called as Two-Level Adaptive Multiple Branch Prediction Using a Global History Register and a Global Pattern History Table (MGAg). When multiple PHTs are employed in this scheme for each primary branch address, it is called as Two-Level Adaptive Multiple Branch Prediction using a Global History Register and Per-primary address Pattern History Tables (MGAp). The MGAp includes several disadvantages. As with MGAg, the prediction of a branch is interfered by the history of other branches due to the use of a single global history register. There is also a problem in the prediction mechanism of multiple branches. For example, when two branches are predicted, the prediction of the secondary branch is based on the yet unresolved prediction value of the primary branch; if the primary branch is mispredicted, the secondary branch can be also mispredicted. This dependence causes the sequential generation of prediction values of multiple branches, which might aect the cycle time since the table lookup for prediction already requires a considerable amount of time. Another method of multiple branch prediction has been proposed by Dutta and Franklin 3

5 where a tree-like subgraph of the control ow graph was employed [9]. In this scheme, multiple branches are predicted indirectly by predicting a path in the subgraph. The advantage of this is performing multiple branch predictions in a cycle without determining the address of these branches. However, instead of not storing the condensed history of all branches, the subgraph history pattern must be stored. Unfortunately, the reported branch prediction accuracy is not higher than that of Yeh or its proposed multiple branch predictor which employs a 2-bit saturating updown counter for each PHT entry. 3 Multiple Branch Prediction Using a Per-Primary Address Branch History Table Our scheme of multiple branch prediction is simple and straightforward. In order to reduce interferences in the rst level of branch histories, one history register should be provided for each distinct primary branch which introduces the Branch History Table (BHT). For two branch predictions where the branch history length is k bits, each entry of the BHT is composed of a BHR with the length of 2k bits. In addition, separate PHTs are employed for each primary branch address. Each entry of the new PHT contains a single 2-bit saturating up-down counter like the original PHT. Both primary and secondary branches are predicted by accessing the BHT and the PHT using the primary branch address. When two branches are predicted each cycle, the rst half k bits and the second half k bits in the history register are used to index into the PHT separately for the prediction of the primary and the secondary branch, respectively. When three branches are predicted at each cycle, each entry of the BHT is composed of 3k bits and last k bits are used 4

6 for the prediction of the tertiary branch. Only the primary branch address is used for accessing the BHT and the PHT, since the branch addresses of the secondary and the tertiary branch are not known at the time of prediction. This multiple predictor is referred to as Two-level Adaptive Multiple Branch Prediction using a Per-Primary Address Branch History Table and Per-Primary Address Pattern History Tables (MPAp). Figure 2 depicts the prediction mechanism of the MPAp scheme for two branch predictions. Primary Branch Address Branch History Table Pattern History Table 2 k k k Secondary Branch Prediction primary Secondary k Primary Branch Prediction Figure 2: The MPAp scheme for two branch predictions. In the previous global history scheme, the prediction of a branch is aected by the history of other branches since all branch predictions are based on a single global history register. In our scheme, however, only two secondary branches share a single BHR associated with each primary branch address for the case of two branch predictions. For the case of three branch prediction, the two secondary branches and the four tertiary branches are shared. Figure 3 (a) and (b) show which basic blocks share the same BHR when two and three branches are predicted, respectively. Although there still exist interferences among those branches that share the same BHR in the same cycle, they are much less compared to the global history scheme. Another advantage of this scheme is that it does not cause dependences among those simultaneously predicted branches. Unlike the global history scheme, this method independently performs 5

7 primary branch primary branch secondary branch secondary branch tertiary branch (a) two branch predictions (b) three branch predictions Figure 3: Number of accessing basic blocks per each primary address. the prediction of multiple branches once the primary branch address is known. Consequently, multiple branches can be predicted in parallel which allows a faster prediction. 4 Experimental Results In order to compare the prediction accuracy of our scheme with that of the previous global history scheme, we have performed a comprehensive empirical study for various hardware congurations, considering the implementation costs. We also compare the impact of prediction accuracy on the fetch bandwidth of a superscalar machine. 4.1 Experimental Environment We use the trace-driven simulation using ten programs in SPEC benchmarks. Four integer programs are eqntott, espresso, xlisp, and gcc. Six oating-point programs are nasa7, doduc, spice2g6, tomcatv, matrix300, and fpppp. These programs are compiled by C and Fortran 77 with the compiler optimizations turned on. The tracing system is based on SPARCstation 2 [10]. In order to obtain the instruction traces of our benchmark programs, a tool called Shadow is used [11]. Each 6

8 benchmark is traced for ten million instructions which are fed into the multiple branch predictor. In order to obtain the trace uniformly from the wide execution range of each benchmark, two million instructions are sampled for ve times to trace up to the rst fty million instructions. The Branch Address Cache (BAC) has 1024 entries with the set associativity of four. The conguration of the Branch History Table (BHT) is also 1024-entry and 4-way set associative, utilizing the LRU (Least-Recently-Used) algorithm for replacement. We vary the BHR length from 4 to 14 bits and vary the number of tables for the PHT from 1 to 256. The size of our instruction cache is 32K bytes with the block size of 16 bytes, and it is 8-way interleaved with the set associativity of two. The miss penalty is 4 cycles. We assume that a fetch address can access two banks simultaneously. Consequently, a maximum of 16 instructions can be supplied from the instruction cache to the processing unit in each cycle, which is the maximum fetch bandwidth. We compare our multiple branch predictor MPAp, with the original multiple branch predictor MGAp of Yeh and Patt [4]. For clarication, the MPAp and the MGAp will also be called interchangeably the per-primary address history scheme and the global history scheme, respectively. 4.2 The Results Opportunities for Multiple Branch Prediction As we have described above, maximum 16 instructions can be fetched from the instruction cache in each cycle. When a basic block is large as in most oating-point benchmarks, the chance of performing multiple branch prediction is low. Figure 4 (a) and (b) describe the distribution of execution cycles depending on the number of branches predicted in each cycle for two and three branch predictions, respectively. 7

9 eq es li gc na dd sp tc mt fp Branch prediction utilization Branch prediction utilization two branch predictions per cycle three branch predictions per cycle branch prediction 1-branch prediction branch prediction 1-branch prediction 2-branch prediction 2-branch prediction 3-branch prediction 80.0 branch prediction utilization[%] branch prediction utilization[%] eq es li gc na dd sp tc mt fp benchmarks benchmarks (a) two branch predictions (b) three branch predictions Figure 4: Branch prediction utilization when two and three basic blocks are fetched. Zero-branch prediction occurs when we are fetching a long sequential segment of code or when the fetch address misses in the BAC. Floating-point programs have a relatively high frequency of zero-branch prediction due to their extremely long sequential code segment which is executed repeatedly. Only the spice2g6 includes many opportunities of multiple branch prediction since its average basic block size on the trace is small (i.e., 4.3 instructions). For integer benchmarks, the average percentages of cycles for zero, one, and two branch predictions are 4.5%, 50.0%, and 45.6%, respectively, for two branch predictions. For oating-point benchmarks, the values are 71.1%, 24.2%, and 21.3%, respectively. For three branch prediction, the distribution of cycles for integer benchmarks is 4.5% (0), 33.3% (1), 33.1% (2), and 29.2% (3), while that of oating-point benchmarks is 70.6% (0), 17.5% (1), 16.3% (2), and 12.2% (3), respectively. 8

10 4.2.2 Branch Prediction Accuracy Figure 5 (a) depicts the average prediction accuracy of integer benchmarks when MPAp and MGAp are used for two branch predictions, as a function of BHR lengths and the number of tables for the PHT. The accuracy of MGAp is sensitive to the BHR length and the number of tables for the PHT. When the BHR length is only 4-bit and a single PHT is employed, the average prediction accuracy is below 78 percent. We need to have the 14-bit length of BHR and 256 PHTs to obtain the maximum prediction accuracy for MGAp. The average prediction accuracy for integer programs ranges from 77.7% to 96.5% depending on the size of the hardware. Branch Prediction Accuracies Branch Prediction Accuracies Integer Benchmarks, 2-branch Floating Point Benchmarks, 2-branch Prediction Accuracies [%] MPAp, PHT=1 MPAp, PHT=16 Prediction Accuracies [%] MPAp, PHT=1 MPAp, PHT= MPAp, PHT=256 MGAp, PHT=1 MGAp, PHT=16 MGAp, PHT= MPAp, PHT=256 MGAp, PHT=1 MGAp, PHT=16 MGAp, PHT= BHR Length [bits] BHR Length [bits] (a) integer programs (b) oating-point programs Figure 5: The prediction accuracies of MPAp and MGAp scheme for two branch predictions. On the other hand, the prediction accuracy of MPAp is not quite sensitive to the BHR length and the number of PHTs. Using the 4-bit BHR and a single PHT, we can obtain the prediction accuracy of 92%, which outperforms MGAp by more than 14%. The average prediction accuracy for integer programs ranges from 92.0% to 96.9% depending on the size of the hardware. Figure 5 (b) shows the average prediction accuracy of oating-point programs with the same 9

11 hardware congurations as above. For oating-point benchmarks, the accuracy curve of MGAp is less sensitive to the BHR length and the number of PHTs than in integer programs. This is due to the periodic branch behavior of oating-point programs which makes their branches easier to predict. The average prediction accuracy of MGAp ranges from 87.4% to 95.5% 1. Whereas, the average prediction accuracy of MPAp ranges from 94.8% to 95.8%, again less sensitive to the size of the hardware. Figure 6 shows the branch prediction accuracy for three branch predictions which exhibits a similar curve with the result of two branch predictions in Figure 5. One thing to note is that the prediction accuracy of three branch predictions is lower than that of two branch predictions, which is what we have expected due to additional interferences caused by more sharing. However, the dierence is much lower in MPAp compared to MGAp, which indicates the stability of our scheme Comparison of Prediction Accuracy under the Same Hardware Budget We compare the prediction accuracy of MGAp and MPAp under the same hardware budget. Table 1 describes the estimation of hardware costs (number of bits) for MGAp and MPAp as a function of the BHR length and the number of PHTs. Given 128K bits of hardware budget, the best prediction accuracy of MGAp and MPAp for two branch predictions can be obtained with MGAp(8,256) and MPAp(10,16), respectively. Applying the hardware cost function, the cost of MGAp(8,256) is exactly 128 K bits, whereas that of MPAp(10,16) is only 112 K bits. Figure 7 (a) compares these best prediction accuracies 1 Our prediction accuracy of MGAp appears to be a little lower than the result of Yeh and Patt because we include the result of nasa7 which is missing in theirs. The prediction accuracy of nasa7 is the lowest and reduces the average accuracy of oating-point benchmarks from around 98% down to 95%. 10

12 Branch Prediction Accuracies Branch Prediction Accuracies Integer Benchmarks, 3-branch Floating Point Benchmarks, 3-branch Prediction Accuracies [%] MPAp, PHT=1 MPAp, PHT=16 MPAp, PHT=256 MGAp, PHT=1 MGAp, PHT=16 MGAp, PHT=256 Prediction Accuracies [%] MPAp, PHT=1 MPAp, PHT=16 MPAp, PHT=256 MGAp, PHT=1 MGAp, PHT=16 MGAp, PHT= BHR Length [bits] BHR Length [bits] (a) integer programs (b) oating-point programs Figure 6: The prediction accuracies of MPAp and MGAp scheme for three branch predictions. of both congurations for each benchmark. For most benchmarks, MPAp(10,16) outperforms MGAp(8,256). The average prediction accuracy of MGAp(8,256) is 95.1% (integer) and 94.9% (oating-point), whereas that of MPAp (10,16) is 96.1% (integer) and 95.7% (oating-point). Figure 7 (b) depicts the same graph when we are given 512K bits of hardware budget where the best prediction accuracy is obtained by MGAp(10,256) and MPAp(12,16), respectively. The cost of MGAp(10,256) ts to 512 K bits exactly, whereas that of MPAp(12,16) is only 224 K bits. However, the graph of latter shows a similar yet a higher prediction accuracy. The average prediction accuracy of MGAp(8,256) is 95.9% (integer) and 95.3% (oating-point), whereas that of MPAp (10,16) is 96.4% (integer) and 95.7% (oating-point). It is very encouraging that our proposed scheme outperforms the previous one with only 44% of hardware cost. The simulation is repeated for three branch predictions, as shown in Figure 8. Given 128 K bits of hardware budget, MGAp(8,256) and MPAp(8,16) are selected. The cost of MGAp(8,256) is unchanged with the number of predicted branches, whereas the actual hardware cost of the 11

13 Table 1: Branch predictor congurations and their estimated costs; b is the number of entries in the BHT; s is the set associativity of the BHT; m is the number of branches (2 or 3) predicted per cycle. scheme BHR number of hardware name length PHTs cost MGAp(h,p) h p h+p2 h 2 MPAp(h,p) h p bshm+p2 h 2 MPAp(8,16) is only 104 K bits. The average prediction accuracies of MGAp(8,256) are 93.6% and 93.8% for integer and oating point programs, respectively. For MPAp(8,16), they are increased to 94.2% and 95.1%. For the hardware budget of 512 K bits, MGAp(10,256) and MPAp(12,16) are simulated. For MGAp(10,256) where the cost is exactly 512 K bits, the average prediction accuracies are 94.4% and 95.1% for integer and oating point programs. Whereas, the respective accuracies of MPAp(12,16) are enhanced to 94.6% and 95.2%, with only 272 K bits resulting in 53% of hardware cost. Comparing with the results of the two branch predictions, the prediction accuracies are decreased from 0.2 to 1.8 percents by the increase in the number of sharing Fetch Bandwidth of MGAp and MPAp We evaluate how the increased prediction accuracy of MPAp increases the fetch bandwidth of the superscalar machine. We measure the IP C f, the average number of instructions that can be fetched from the instruction cache in each cycle. Figure 9 (a) compares the IP C f of MGAp(8, 256) and MPAp(10,16) for two branch predictions under the budget of 128K bits. For integer benchmarks, the graph indicates the increased prediction accuracy of MPAp(10,16) obtains a 12

14 eq es li gc na dd sp tc mt fp Branch Prediction Accuracies Branch Prediction Accuracies 2-branch predictions, 128 Kbits 2-branch predictions, 512 Kbits MGAp(8,256) MGAp(10,256) MPAp(10,16) MPAp(12,16) Prediction Accuracies [%] 85.0 Prediction Accuracies [%] eq es li gc na dd sp tc mt fp Benchmark Programs Benchmark Programs (a) The implementation of 128K bits (b) The implementation of 512K bits Figure 7: The prediction accuracies of MGAp and MPAp with the same implementation cost for two branch predictions. better fetch bandwidth than MGAp(8, 256), i.e., 7.12 vs For oating-point benchmarks, the increase is vs Figure 9 (b) compares the IP C f of MGAp(10,256) and MPAp(12,16) for two branch predictions under 512K bits, which indicates a similar result. For integer benchmarks, the fetch bandwidth of MPAp (12,16) compared with MGAp(8,256) is 7.23 vs For oating-point benchmarks, the increase is vs Figure 10 (a) depicts the simulation results of MGAp(8,256) and MPAp(8,16) for three branch predictions. The fetch bandwidth of MPAp(8,16) outperforms MGAp(8,256) again, both for integer (7.53 vs. 7.52) and oating point programs (11.44 vs ). Finally, Figure 10(b) compares the IP C f of MGAp(10,256) and MPAp(12,16). The fetch bandwidth is increased from 7.78 to 7.91 for integer, and from to for oating point benchmarks. Three branch predictions obtain a better fetch bandwidth than two branch predictions, although its prediction 13

15 eq es li gc na dd sp tc mt fp Branch Prediction Accuracies Branch Prediction Accuracies 3-branch predictions, 128 Kbits 3-branch predictions, 512 Kbits MGAp(8,256) MGAp(10,256) MPAp(8,16) MPAp(12,16) Prediction Accuracies [%] 85.0 Prediction Accuracies [%] eq es li gc na dd sp tc mt fp Benchmark Programs Benchmark Programs (a) The implementation of 128K bits (b) The implementation of 512K bits Figure 8: The prediction accuracies of MGAp and MPAp with the same implementation cost for three branch predictions. accuracy is lower. This is true because more basic blocks can be fetched simultaneously if the prediction is correct even if the overall prediction accuracy is slightly lower. 5 Summary We have proposed an enhanced mechanism of multiple branch prediction where the interferences among branches are reduced and the prediction of subsequent branches does not depend on the unresolved prediction of the preceding branch, thus improving the overall prediction accuracy. The experimental results indicate that our scheme can achieve a much better prediction accuracy (i.e., as much as 7% to 14% with a 4-bit BHR and with a single PHT) than the previous global history scheme of Yeh and Patt. Even when the hardware budget for multiple branch prediction is kept the same, our scheme still achieves a higher prediction accuracy with lower hardware cost (i.e., as much as 2.7% in some benchmark with only 44% of hardware cost). Finally, the increased 14

16 eq es li gc na dd sp tc mt fp Instruction Fetch Bandwidth Comparison Instruction Fetch Bandwidth Comparison 2-branch predictions, 128 K bists 2-branch predictions, 512kbits MGAp(8,256) MGAp(10,256) 14.0 MPAp(10,16) 14.0 MPAp(12,16) Instructions per fetch Instructions per fetch eq es li gc na dd sp tc mt fp benchmarks benchmarks (a) The implementation of 128K bits (b) The implementation of 512K bits Figure 9: IP C f of MGAp and MPAp for two branch predictions. prediction accuracy results in better fetch bandwidth, which is essential for the performance enhancement in superscalar processors. References [1] T.-Y. Yeh and Y.N. Patt, Two-level adaptive branch prediction, in: Proc. Micro-24, (1991), 51{61. [2] T.-Y. Yeh and Y.N. Patt, Alternative implementations of two-level adaptive branch prediction, in: Proc. ISCA '92, (1992), 124{134. [3] T.-Y. Yeh and Y.N. Patt, A comparison of dynamic branch predictors that use two levels of branch history, in: Proc. ISCA '93, (1993),. [4] T.-Y. Yeh, D.T. Marr, and Y.N. Patt, Increasing the instruction fetch rate via multiple branch prediction and a branch address cache, in: ICS '93, (1993), 67{76. 15

17 eq es li gc na dd sp tc mt fp Instruction Fetch Bandwidth Comparison Instruction Fetch Bandwidth Comparison 3-branch predictions, 128kbits 3-branch predictions, 512kbits MGAp(8,256) MGAp(10,256) 14.0 MPAp(8,16) 14.0 MPAp(12,16) Instructions per fetch Instructions per fetch eq es li gc na dd sp tc mt fp benchmarks benchmarks (a) The implementation of 128K bits (b) The implementation of 512K bits Figure 10: IP C f of MGAp and MPAp for three branch predictions. [5] J.E. Smith, A study of branch prediction strategies, in: Proc. ISCA '81, (1981), 135{148. [6] J.K.L. Lee and A.J. Smith, Branch prediction strategies and branch target buer design, IEEE Computer, 17(1984), 6{22. [7] S. McFarling and J. Henessy, Reducing the cost of branches, in: Proc. ISCA '86, (1986), 396{403. [8] K. So, S.-T. Pan and J.T. Rameh, Improving the accuracy of dynamic branch prediction using branch correlation, in: Proc. ASPLOS-5, (1982), 76{84. [9] S. Dutta and M. Franklin, Control ow prediction with tree-like subgraphs for superscalar processors, in: Proc. ISCA '95, (1995), 258{263. [10] Sun Microsystems, The SPARC Architecture Manual, (Prentice-Hall, 1992). [11] Sun Microsystems, Introduction to SHADOW, (Sun Microsystems, 1989). 16

Portland State University ECE 587/687. Branch Prediction

Portland State University ECE 587/687 Branch Prediction Copyright by Alaa Alameldeen and Haitham Akkary 2015 Branch Penalty Example: Comparing perfect branch prediction to 90%, 95%, 99% prediction accuracy,