Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

Size: px
Start display at page:

Download "Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits"

Transcription

1 An Enhanced Two-Level Adaptive Multiple Branch Prediction for Superscalar Processors Jong-bok Lee, Soo-Mook Moon and Wonyong Sung School of Electrical Engineering, Seoul National University San 56-1 ShinLim-Dong, KwanAk-Gu, Seoul , Korea Abstract This paper proposes an enhanced method of multiple branch prediction using a per-primary branch history table. This scheme improves the previous ones based on a single global branch history register, by reducing interferences among histories of dierent branches caused by sharing a single register. This scheme also allows the prediction of a branch not to aect the prediction of other branches that are predicted in the same cycle, thus allowing independent and parallel prediction of multiple branches. Our experimental results indicate that these features help to achieve higher prediction accuracy than that of the previous global history scheme (which is already high) with the less hardware cost (i.e., 96.1% vs. 95.1% for integer code and 95.7% vs. 94.9% for oating-point code including nasa7, for a given hardware budget of 128K bits). Moreover, the increased prediction accuracy causes better fetch bandwidth of a superscalar machine (i.e., 7.1 vs. 6.9 instructions per clock cycle for integer code and 11.0 vs instructions per cycle for oating-point code).

2 1 Introduction In order to increase the fetch bandwidth of a superscalar processor, we need to predict more than one branch and to fetch multiple non-consecutive basic blocks in a single cycle. Yeh and Patt developed a two-level adaptive branch prediction scheme [1, 2, 3] and extended it to predicting multiple branches per cycle [4]. Several variations of the scheme have been introduced in [4], yet all of them uses a global history register in common which makes each branch to share the same space for storing its prediction history. An obvious problem of the global history register is the interferences between dierent branches caused by the sharing. Moreover, those multiple branches that need to be predicted simultaneously includes a dependency in their prediction, which lowers the prediction accuracy as the number of simultaneously predicted branches increases. In order to overcome these shortcomings, this paper proposes an enhanced two-level adaptive multiple branch prediction using a per-primary address branch history table. In this scheme, only those branches that are predicted simultaneously share the same space, thus reducing interferences. Moreover, those multiple branches that are predicted in the same cycle are not constrained by any dependences, thus being predicted independently. We propose several hardware congurations for the per-primary address branch history scheme and compare with the previous ones through simulation. The performance is evaluated by conducting an empirical study on a subset of SPEC benchmark suite using the trace-driven simulation. Our results indicate that the proposed scheme improves the branch prediction accuracy, hence the fetch bandwidth of superscalar processors under the same hardware budget. The rest of this paper is organized as follows. Section 2 briey reviews the two-level adaptive branch prediction and the previous multiple branch prediction schemes. Section 3 describes the per-primary address history scheme. Section 4 presents the simulation environments and results. 1

3 Finally, a summary follows in Section 5. 2 Previous Two-Level Adaptive Multiple Branch Prediction Many branch prediction schemes that utilize the run-time execution history have been proposed [5, 6, 7], yet two-level adaptive branch prediction is known to obtain the highest prediction accuracy [1, 2, 8]. Two-level adaptive branch prediction uses two major data structures, the branch history register (BHR) and the pattern history table (PHT), as shown in Figure 1 (a). The Pattern History Table Branch History Pattern Global History Register k Pattern History Table Branch History Register Pattern History Bits Prediction k 1 Secondary Branch Prediction Index select k Primary Branch Prediction State Transition Logic (a) The basic structure. (b) The multiple branch prediction. Figure 1: Two-level adaptive multiple branch prediction. BHR is used to record the history of taken and not-taken for branches. For each possible pattern in the BHR, a pattern history is recorded in the PHT. When the BHR contains k bits to record the history of the last k branches, there are 2 k possible patterns in the BHR. Hence, the PHT has 2 k entries, each of which contains a 2-bit up-down saturating counter to record the execution history of the corresponding pattern occurred in the BHR. The counter is incremented when the result of a branch is taken; otherwise, the counter is decremented. Branch prediction is made based on the interpretation of both pattern history bits as in Figure 1 (a). The method of multiple branch prediction proposed by Yeh and Patt employs a simple ex- 2

4 tension of the above scheme [4]. The extended global history scheme makes the prediction of an immediately following branch and extrapolates the predictions of subsequent branches. As shown in Figure 1(b), all k bits in the history register are used to index into the PHT to make a primary branch prediction. To predict the secondary branch, the right-most k-1 branch history bits are used to index into the PHT. Since not all the k bits are used as an index, k-1 bits address 2 adjacent entries in the PHT. Then, the primary branch prediction is used to select one of the entries to make the secondary branch prediction. Similarly, the tertiary prediction uses the right-most k-2 history register bits to address the PHT and accesses 4 adjacent entries. The primary and secondary predictions are used to select one of the 4 entries for the tertiary branch path prediction. Since a global BHR and a global PHT is employed in this scheme, it is called as Two-Level Adaptive Multiple Branch Prediction Using a Global History Register and a Global Pattern History Table (MGAg). When multiple PHTs are employed in this scheme for each primary branch address, it is called as Two-Level Adaptive Multiple Branch Prediction using a Global History Register and Per-primary address Pattern History Tables (MGAp). The MGAp includes several disadvantages. As with MGAg, the prediction of a branch is interfered by the history of other branches due to the use of a single global history register. There is also a problem in the prediction mechanism of multiple branches. For example, when two branches are predicted, the prediction of the secondary branch is based on the yet unresolved prediction value of the primary branch; if the primary branch is mispredicted, the secondary branch can be also mispredicted. This dependence causes the sequential generation of prediction values of multiple branches, which might aect the cycle time since the table lookup for prediction already requires a considerable amount of time. Another method of multiple branch prediction has been proposed by Dutta and Franklin 3

5 where a tree-like subgraph of the control ow graph was employed [9]. In this scheme, multiple branches are predicted indirectly by predicting a path in the subgraph. The advantage of this is performing multiple branch predictions in a cycle without determining the address of these branches. However, instead of not storing the condensed history of all branches, the subgraph history pattern must be stored. Unfortunately, the reported branch prediction accuracy is not higher than that of Yeh or its proposed multiple branch predictor which employs a 2-bit saturating updown counter for each PHT entry. 3 Multiple Branch Prediction Using a Per-Primary Address Branch History Table Our scheme of multiple branch prediction is simple and straightforward. In order to reduce interferences in the rst level of branch histories, one history register should be provided for each distinct primary branch which introduces the Branch History Table (BHT). For two branch predictions where the branch history length is k bits, each entry of the BHT is composed of a BHR with the length of 2k bits. In addition, separate PHTs are employed for each primary branch address. Each entry of the new PHT contains a single 2-bit saturating up-down counter like the original PHT. Both primary and secondary branches are predicted by accessing the BHT and the PHT using the primary branch address. When two branches are predicted each cycle, the rst half k bits and the second half k bits in the history register are used to index into the PHT separately for the prediction of the primary and the secondary branch, respectively. When three branches are predicted at each cycle, each entry of the BHT is composed of 3k bits and last k bits are used 4

6 for the prediction of the tertiary branch. Only the primary branch address is used for accessing the BHT and the PHT, since the branch addresses of the secondary and the tertiary branch are not known at the time of prediction. This multiple predictor is referred to as Two-level Adaptive Multiple Branch Prediction using a Per-Primary Address Branch History Table and Per-Primary Address Pattern History Tables (MPAp). Figure 2 depicts the prediction mechanism of the MPAp scheme for two branch predictions. Primary Branch Address Branch History Table Pattern History Table 2 k k k Secondary Branch Prediction primary Secondary k Primary Branch Prediction Figure 2: The MPAp scheme for two branch predictions. In the previous global history scheme, the prediction of a branch is aected by the history of other branches since all branch predictions are based on a single global history register. In our scheme, however, only two secondary branches share a single BHR associated with each primary branch address for the case of two branch predictions. For the case of three branch prediction, the two secondary branches and the four tertiary branches are shared. Figure 3 (a) and (b) show which basic blocks share the same BHR when two and three branches are predicted, respectively. Although there still exist interferences among those branches that share the same BHR in the same cycle, they are much less compared to the global history scheme. Another advantage of this scheme is that it does not cause dependences among those simultaneously predicted branches. Unlike the global history scheme, this method independently performs 5

7 primary branch primary branch secondary branch secondary branch tertiary branch (a) two branch predictions (b) three branch predictions Figure 3: Number of accessing basic blocks per each primary address. the prediction of multiple branches once the primary branch address is known. Consequently, multiple branches can be predicted in parallel which allows a faster prediction. 4 Experimental Results In order to compare the prediction accuracy of our scheme with that of the previous global history scheme, we have performed a comprehensive empirical study for various hardware congurations, considering the implementation costs. We also compare the impact of prediction accuracy on the fetch bandwidth of a superscalar machine. 4.1 Experimental Environment We use the trace-driven simulation using ten programs in SPEC benchmarks. Four integer programs are eqntott, espresso, xlisp, and gcc. Six oating-point programs are nasa7, doduc, spice2g6, tomcatv, matrix300, and fpppp. These programs are compiled by C and Fortran 77 with the compiler optimizations turned on. The tracing system is based on SPARCstation 2 [10]. In order to obtain the instruction traces of our benchmark programs, a tool called Shadow is used [11]. Each 6

8 benchmark is traced for ten million instructions which are fed into the multiple branch predictor. In order to obtain the trace uniformly from the wide execution range of each benchmark, two million instructions are sampled for ve times to trace up to the rst fty million instructions. The Branch Address Cache (BAC) has 1024 entries with the set associativity of four. The conguration of the Branch History Table (BHT) is also 1024-entry and 4-way set associative, utilizing the LRU (Least-Recently-Used) algorithm for replacement. We vary the BHR length from 4 to 14 bits and vary the number of tables for the PHT from 1 to 256. The size of our instruction cache is 32K bytes with the block size of 16 bytes, and it is 8-way interleaved with the set associativity of two. The miss penalty is 4 cycles. We assume that a fetch address can access two banks simultaneously. Consequently, a maximum of 16 instructions can be supplied from the instruction cache to the processing unit in each cycle, which is the maximum fetch bandwidth. We compare our multiple branch predictor MPAp, with the original multiple branch predictor MGAp of Yeh and Patt [4]. For clarication, the MPAp and the MGAp will also be called interchangeably the per-primary address history scheme and the global history scheme, respectively. 4.2 The Results Opportunities for Multiple Branch Prediction As we have described above, maximum 16 instructions can be fetched from the instruction cache in each cycle. When a basic block is large as in most oating-point benchmarks, the chance of performing multiple branch prediction is low. Figure 4 (a) and (b) describe the distribution of execution cycles depending on the number of branches predicted in each cycle for two and three branch predictions, respectively. 7

9 eq es li gc na dd sp tc mt fp Branch prediction utilization Branch prediction utilization two branch predictions per cycle three branch predictions per cycle branch prediction 1-branch prediction branch prediction 1-branch prediction 2-branch prediction 2-branch prediction 3-branch prediction 80.0 branch prediction utilization[%] branch prediction utilization[%] eq es li gc na dd sp tc mt fp benchmarks benchmarks (a) two branch predictions (b) three branch predictions Figure 4: Branch prediction utilization when two and three basic blocks are fetched. Zero-branch prediction occurs when we are fetching a long sequential segment of code or when the fetch address misses in the BAC. Floating-point programs have a relatively high frequency of zero-branch prediction due to their extremely long sequential code segment which is executed repeatedly. Only the spice2g6 includes many opportunities of multiple branch prediction since its average basic block size on the trace is small (i.e., 4.3 instructions). For integer benchmarks, the average percentages of cycles for zero, one, and two branch predictions are 4.5%, 50.0%, and 45.6%, respectively, for two branch predictions. For oating-point benchmarks, the values are 71.1%, 24.2%, and 21.3%, respectively. For three branch prediction, the distribution of cycles for integer benchmarks is 4.5% (0), 33.3% (1), 33.1% (2), and 29.2% (3), while that of oating-point benchmarks is 70.6% (0), 17.5% (1), 16.3% (2), and 12.2% (3), respectively. 8

10 4.2.2 Branch Prediction Accuracy Figure 5 (a) depicts the average prediction accuracy of integer benchmarks when MPAp and MGAp are used for two branch predictions, as a function of BHR lengths and the number of tables for the PHT. The accuracy of MGAp is sensitive to the BHR length and the number of tables for the PHT. When the BHR length is only 4-bit and a single PHT is employed, the average prediction accuracy is below 78 percent. We need to have the 14-bit length of BHR and 256 PHTs to obtain the maximum prediction accuracy for MGAp. The average prediction accuracy for integer programs ranges from 77.7% to 96.5% depending on the size of the hardware. Branch Prediction Accuracies Branch Prediction Accuracies Integer Benchmarks, 2-branch Floating Point Benchmarks, 2-branch Prediction Accuracies [%] MPAp, PHT=1 MPAp, PHT=16 Prediction Accuracies [%] MPAp, PHT=1 MPAp, PHT= MPAp, PHT=256 MGAp, PHT=1 MGAp, PHT=16 MGAp, PHT= MPAp, PHT=256 MGAp, PHT=1 MGAp, PHT=16 MGAp, PHT= BHR Length [bits] BHR Length [bits] (a) integer programs (b) oating-point programs Figure 5: The prediction accuracies of MPAp and MGAp scheme for two branch predictions. On the other hand, the prediction accuracy of MPAp is not quite sensitive to the BHR length and the number of PHTs. Using the 4-bit BHR and a single PHT, we can obtain the prediction accuracy of 92%, which outperforms MGAp by more than 14%. The average prediction accuracy for integer programs ranges from 92.0% to 96.9% depending on the size of the hardware. Figure 5 (b) shows the average prediction accuracy of oating-point programs with the same 9

11 hardware congurations as above. For oating-point benchmarks, the accuracy curve of MGAp is less sensitive to the BHR length and the number of PHTs than in integer programs. This is due to the periodic branch behavior of oating-point programs which makes their branches easier to predict. The average prediction accuracy of MGAp ranges from 87.4% to 95.5% 1. Whereas, the average prediction accuracy of MPAp ranges from 94.8% to 95.8%, again less sensitive to the size of the hardware. Figure 6 shows the branch prediction accuracy for three branch predictions which exhibits a similar curve with the result of two branch predictions in Figure 5. One thing to note is that the prediction accuracy of three branch predictions is lower than that of two branch predictions, which is what we have expected due to additional interferences caused by more sharing. However, the dierence is much lower in MPAp compared to MGAp, which indicates the stability of our scheme Comparison of Prediction Accuracy under the Same Hardware Budget We compare the prediction accuracy of MGAp and MPAp under the same hardware budget. Table 1 describes the estimation of hardware costs (number of bits) for MGAp and MPAp as a function of the BHR length and the number of PHTs. Given 128K bits of hardware budget, the best prediction accuracy of MGAp and MPAp for two branch predictions can be obtained with MGAp(8,256) and MPAp(10,16), respectively. Applying the hardware cost function, the cost of MGAp(8,256) is exactly 128 K bits, whereas that of MPAp(10,16) is only 112 K bits. Figure 7 (a) compares these best prediction accuracies 1 Our prediction accuracy of MGAp appears to be a little lower than the result of Yeh and Patt because we include the result of nasa7 which is missing in theirs. The prediction accuracy of nasa7 is the lowest and reduces the average accuracy of oating-point benchmarks from around 98% down to 95%. 10

12 Branch Prediction Accuracies Branch Prediction Accuracies Integer Benchmarks, 3-branch Floating Point Benchmarks, 3-branch Prediction Accuracies [%] MPAp, PHT=1 MPAp, PHT=16 MPAp, PHT=256 MGAp, PHT=1 MGAp, PHT=16 MGAp, PHT=256 Prediction Accuracies [%] MPAp, PHT=1 MPAp, PHT=16 MPAp, PHT=256 MGAp, PHT=1 MGAp, PHT=16 MGAp, PHT= BHR Length [bits] BHR Length [bits] (a) integer programs (b) oating-point programs Figure 6: The prediction accuracies of MPAp and MGAp scheme for three branch predictions. of both congurations for each benchmark. For most benchmarks, MPAp(10,16) outperforms MGAp(8,256). The average prediction accuracy of MGAp(8,256) is 95.1% (integer) and 94.9% (oating-point), whereas that of MPAp (10,16) is 96.1% (integer) and 95.7% (oating-point). Figure 7 (b) depicts the same graph when we are given 512K bits of hardware budget where the best prediction accuracy is obtained by MGAp(10,256) and MPAp(12,16), respectively. The cost of MGAp(10,256) ts to 512 K bits exactly, whereas that of MPAp(12,16) is only 224 K bits. However, the graph of latter shows a similar yet a higher prediction accuracy. The average prediction accuracy of MGAp(8,256) is 95.9% (integer) and 95.3% (oating-point), whereas that of MPAp (10,16) is 96.4% (integer) and 95.7% (oating-point). It is very encouraging that our proposed scheme outperforms the previous one with only 44% of hardware cost. The simulation is repeated for three branch predictions, as shown in Figure 8. Given 128 K bits of hardware budget, MGAp(8,256) and MPAp(8,16) are selected. The cost of MGAp(8,256) is unchanged with the number of predicted branches, whereas the actual hardware cost of the 11

13 Table 1: Branch predictor congurations and their estimated costs; b is the number of entries in the BHT; s is the set associativity of the BHT; m is the number of branches (2 or 3) predicted per cycle. scheme BHR number of hardware name length PHTs cost MGAp(h,p) h p h+p2 h 2 MPAp(h,p) h p bshm+p2 h 2 MPAp(8,16) is only 104 K bits. The average prediction accuracies of MGAp(8,256) are 93.6% and 93.8% for integer and oating point programs, respectively. For MPAp(8,16), they are increased to 94.2% and 95.1%. For the hardware budget of 512 K bits, MGAp(10,256) and MPAp(12,16) are simulated. For MGAp(10,256) where the cost is exactly 512 K bits, the average prediction accuracies are 94.4% and 95.1% for integer and oating point programs. Whereas, the respective accuracies of MPAp(12,16) are enhanced to 94.6% and 95.2%, with only 272 K bits resulting in 53% of hardware cost. Comparing with the results of the two branch predictions, the prediction accuracies are decreased from 0.2 to 1.8 percents by the increase in the number of sharing Fetch Bandwidth of MGAp and MPAp We evaluate how the increased prediction accuracy of MPAp increases the fetch bandwidth of the superscalar machine. We measure the IP C f, the average number of instructions that can be fetched from the instruction cache in each cycle. Figure 9 (a) compares the IP C f of MGAp(8, 256) and MPAp(10,16) for two branch predictions under the budget of 128K bits. For integer benchmarks, the graph indicates the increased prediction accuracy of MPAp(10,16) obtains a 12

14 eq es li gc na dd sp tc mt fp Branch Prediction Accuracies Branch Prediction Accuracies 2-branch predictions, 128 Kbits 2-branch predictions, 512 Kbits MGAp(8,256) MGAp(10,256) MPAp(10,16) MPAp(12,16) Prediction Accuracies [%] 85.0 Prediction Accuracies [%] eq es li gc na dd sp tc mt fp Benchmark Programs Benchmark Programs (a) The implementation of 128K bits (b) The implementation of 512K bits Figure 7: The prediction accuracies of MGAp and MPAp with the same implementation cost for two branch predictions. better fetch bandwidth than MGAp(8, 256), i.e., 7.12 vs For oating-point benchmarks, the increase is vs Figure 9 (b) compares the IP C f of MGAp(10,256) and MPAp(12,16) for two branch predictions under 512K bits, which indicates a similar result. For integer benchmarks, the fetch bandwidth of MPAp (12,16) compared with MGAp(8,256) is 7.23 vs For oating-point benchmarks, the increase is vs Figure 10 (a) depicts the simulation results of MGAp(8,256) and MPAp(8,16) for three branch predictions. The fetch bandwidth of MPAp(8,16) outperforms MGAp(8,256) again, both for integer (7.53 vs. 7.52) and oating point programs (11.44 vs ). Finally, Figure 10(b) compares the IP C f of MGAp(10,256) and MPAp(12,16). The fetch bandwidth is increased from 7.78 to 7.91 for integer, and from to for oating point benchmarks. Three branch predictions obtain a better fetch bandwidth than two branch predictions, although its prediction 13

15 eq es li gc na dd sp tc mt fp Branch Prediction Accuracies Branch Prediction Accuracies 3-branch predictions, 128 Kbits 3-branch predictions, 512 Kbits MGAp(8,256) MGAp(10,256) MPAp(8,16) MPAp(12,16) Prediction Accuracies [%] 85.0 Prediction Accuracies [%] eq es li gc na dd sp tc mt fp Benchmark Programs Benchmark Programs (a) The implementation of 128K bits (b) The implementation of 512K bits Figure 8: The prediction accuracies of MGAp and MPAp with the same implementation cost for three branch predictions. accuracy is lower. This is true because more basic blocks can be fetched simultaneously if the prediction is correct even if the overall prediction accuracy is slightly lower. 5 Summary We have proposed an enhanced mechanism of multiple branch prediction where the interferences among branches are reduced and the prediction of subsequent branches does not depend on the unresolved prediction of the preceding branch, thus improving the overall prediction accuracy. The experimental results indicate that our scheme can achieve a much better prediction accuracy (i.e., as much as 7% to 14% with a 4-bit BHR and with a single PHT) than the previous global history scheme of Yeh and Patt. Even when the hardware budget for multiple branch prediction is kept the same, our scheme still achieves a higher prediction accuracy with lower hardware cost (i.e., as much as 2.7% in some benchmark with only 44% of hardware cost). Finally, the increased 14

16 eq es li gc na dd sp tc mt fp Instruction Fetch Bandwidth Comparison Instruction Fetch Bandwidth Comparison 2-branch predictions, 128 K bists 2-branch predictions, 512kbits MGAp(8,256) MGAp(10,256) 14.0 MPAp(10,16) 14.0 MPAp(12,16) Instructions per fetch Instructions per fetch eq es li gc na dd sp tc mt fp benchmarks benchmarks (a) The implementation of 128K bits (b) The implementation of 512K bits Figure 9: IP C f of MGAp and MPAp for two branch predictions. prediction accuracy results in better fetch bandwidth, which is essential for the performance enhancement in superscalar processors. References [1] T.-Y. Yeh and Y.N. Patt, Two-level adaptive branch prediction, in: Proc. Micro-24, (1991), 51{61. [2] T.-Y. Yeh and Y.N. Patt, Alternative implementations of two-level adaptive branch prediction, in: Proc. ISCA '92, (1992), 124{134. [3] T.-Y. Yeh and Y.N. Patt, A comparison of dynamic branch predictors that use two levels of branch history, in: Proc. ISCA '93, (1993),. [4] T.-Y. Yeh, D.T. Marr, and Y.N. Patt, Increasing the instruction fetch rate via multiple branch prediction and a branch address cache, in: ICS '93, (1993), 67{76. 15

17 eq es li gc na dd sp tc mt fp Instruction Fetch Bandwidth Comparison Instruction Fetch Bandwidth Comparison 3-branch predictions, 128kbits 3-branch predictions, 512kbits MGAp(8,256) MGAp(10,256) 14.0 MPAp(8,16) 14.0 MPAp(12,16) Instructions per fetch Instructions per fetch eq es li gc na dd sp tc mt fp benchmarks benchmarks (a) The implementation of 128K bits (b) The implementation of 512K bits Figure 10: IP C f of MGAp and MPAp for three branch predictions. [5] J.E. Smith, A study of branch prediction strategies, in: Proc. ISCA '81, (1981), 135{148. [6] J.K.L. Lee and A.J. Smith, Branch prediction strategies and branch target buer design, IEEE Computer, 17(1984), 6{22. [7] S. McFarling and J. Henessy, Reducing the cost of branches, in: Proc. ISCA '86, (1986), 396{403. [8] K. So, S.-T. Pan and J.T. Rameh, Improving the accuracy of dynamic branch prediction using branch correlation, in: Proc. ASPLOS-5, (1982), 76{84. [9] S. Dutta and M. Franklin, Control ow prediction with tree-like subgraphs for superscalar processors, in: Proc. ISCA '95, (1995), 258{263. [10] Sun Microsystems, The SPARC Architecture Manual, (Prentice-Hall, 1992). [11] Sun Microsystems, Introduction to SHADOW, (Sun Microsystems, 1989). 16

Portland State University ECE 587/687. Branch Prediction

Portland State University ECE 587/687. Branch Prediction Portland State University ECE 587/687 Branch Prediction Copyright by Alaa Alameldeen and Haitham Akkary 2015 Branch Penalty Example: Comparing perfect branch prediction to 90%, 95%, 99% prediction accuracy,

More information

Branch Prediction using Advanced Neural Methods

Branch Prediction using Advanced Neural Methods Branch Prediction using Advanced Neural Methods Sunghoon Kim Department of Mechanical Engineering University of California, Berkeley shkim@newton.berkeley.edu Abstract Among the hardware techniques, two-level

More information

A Detailed Study on Phase Predictors

A Detailed Study on Phase Predictors A Detailed Study on Phase Predictors Frederik Vandeputte, Lieven Eeckhout, and Koen De Bosschere Ghent University, Electronics and Information Systems Department Sint-Pietersnieuwstraat 41, B-9000 Gent,

More information

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction Pipeline no Prediction Branching completes in 2 cycles We know the target address after the second stage? PC fetch Instruction Memory Decode Check the condition Calculate the branch target address PC+4

More information

Fall 2011 Prof. Hyesoon Kim

Fall 2011 Prof. Hyesoon Kim Fall 2011 Prof. Hyesoon Kim Add: 2 cycles FE_stage add r1, r2, r3 FE L ID L EX L MEM L WB L add add sub r4, r1, r3 sub sub add add mul r5, r2, r3 mul sub sub add add mul sub sub add add mul sub sub add

More information

ICS 233 Computer Architecture & Assembly Language

ICS 233 Computer Architecture & Assembly Language ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by

More information

A Novel Meta Predictor Design for Hybrid Branch Prediction

A Novel Meta Predictor Design for Hybrid Branch Prediction A Novel Meta Predictor Design for Hybrid Branch Prediction YOUNG JUNG AHN, DAE YON HWANG, YONG SUK LEE, JIN-YOUNG CHOI AND GYUNGHO LEE The Dept. of Computer Science & Engineering Korea University Anam-dong

More information

AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE

AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE CARNEGIE MELLON UNIVERSITY AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for the degree

More information

CPSC 3300 Spring 2017 Exam 2

CPSC 3300 Spring 2017 Exam 2 CPSC 3300 Spring 2017 Exam 2 Name: 1. Matching. Write the correct term from the list into each blank. (2 pts. each) structural hazard EPIC forwarding precise exception hardwired load-use data hazard VLIW

More information

Branch History Matching: Branch Predictor Warmup for Sampled Simulation

Branch History Matching: Branch Predictor Warmup for Sampled Simulation Branch History Matching: Branch Predictor Warmup for Sampled Simulation Simon Kluyskens Lieven Eeckhout ELIS Department, Ghent University Sint-Pietersnieuwstraat 41, B-9000 Gent, Belgium Email: leeckhou@elis.ugent.be

More information

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I. Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

Limits to Branch Prediction

Limits to Branch Prediction This document was created with FrameMaker 4.0.4 Limits to Branch Prediction Trevor N. Mudge*, I-Cheng K. Chen, and John T. Coffey Electrical Engineering and Computer Science Department The University of

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

Vector Lane Threading

Vector Lane Threading Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law Topics 2 Page The Nature of Time real (i.e. wall clock) time = User Time: time spent

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Topics Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law 2 The Nature of Time real (i.e. wall clock) time = User Time: time spent executing

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

Professor Lee, Yong Surk. References. Topics Microprocessor & microcontroller. High Performance Microprocessor Architecture Overview

Professor Lee, Yong Surk. References. Topics Microprocessor & microcontroller. High Performance Microprocessor Architecture Overview This lecture was mae by a generous contribution of C & S Technology corporation. (http://( http://www.cnstec.com) There is no copyright on this lecture. Processor Laboratory homepage, (http://( http://mpu.yonsei.ac.kr)

More information

Exploiting Bias in the Hysteresis Bit of 2-bit Saturating Counters in Branch Predictors

Exploiting Bias in the Hysteresis Bit of 2-bit Saturating Counters in Branch Predictors Journal of Instruction-Level Parallelism 5(23) -32 Submitted 2/2; published 6/3 Exploiting Bias in the Hysteresis Bit of 2-bit Saturating Counters in Branch Predictors Gabriel H. Loh Dana S. Henry Arvind

More information

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT CSE 560 Practice Problem Set 4 Solution 1. In this question, you will examine several different schemes for branch prediction, using the following code sequence for a simple load store ISA with no branch

More information

Performance, Power & Energy

Performance, Power & Energy Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance?

More information

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm CSCI 1760 - Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm Shay Mozes Brown University shay@cs.brown.edu Abstract. This report describes parallel Java implementations of

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting

More information

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018 ECE 172 Digital Systems Chapter 12 Instruction Pipelining Herbert G. Mayer, PSU Status 7/20/2018 1 Syllabus l Scheduling on Pipelined Architecture l Idealized Pipeline l Goal of Scheduling l Causes for

More information

BETTER BRANCH PREDICTION THROUGH PROPHET/CRITIC HYBRIDS

BETTER BRANCH PREDICTION THROUGH PROPHET/CRITIC HYBRIDS BETTER BRANCH PREDICTION THROUGH PROPHET/CRITIC HYBRIDS THE PROPHET/CRITIC HYBRID CONDITIONAL BRANCH PREDICTOR HAS TWO COMPONENT PREDICTORS. THE PROPHET USES A BRANCH S HISTORY TO PREDICT ITS DIRECTION.

More information

Unit 6: Branch Prediction

Unit 6: Branch Prediction CIS 501: Computer Architecture Unit 6: Branch Prediction Slides developed by Joe Devie/, Milo Mar4n & Amir Roth at Upenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi,

More information

Counters. We ll look at different kinds of counters and discuss how to build them

Counters. We ll look at different kinds of counters and discuss how to build them Counters We ll look at different kinds of counters and discuss how to build them These are not only examples of sequential analysis and design, but also real devices used in larger circuits 1 Introducing

More information

256.bzip2, ref.graphic. Datasets profile vs. Reference Dataset. 256.bzip2, ref.graphic

256.bzip2, ref.graphic. Datasets profile vs. Reference Dataset. 256.bzip2, ref.graphic Datasets profile vs. Reference Dataset The following are the profiles for the benchmark. For more details about our profile development and dataset reduction methodology, refer to the paper by AJ KleinOsowski

More information

Lecture 3, Performance

Lecture 3, Performance Repeating some definitions: Lecture 3, Performance CPI MHz MIPS MOPS Clocks Per Instruction megahertz, millions of cycles per second Millions of Instructions Per Second = MHz / CPI Millions of Operations

More information

Lecture 3, Performance

Lecture 3, Performance Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations

More information

Bus transit = 20 ns (one way) access, each module cannot be accessed faster than 120 ns. So, the maximum bandwidth is

Bus transit = 20 ns (one way) access, each module cannot be accessed faster than 120 ns. So, the maximum bandwidth is 32 Flynn: Coputer Architecture { The Solutions Chapter 6. Meory Syste Design Proble 6. The eory odule uses 64 4 M b chips for 32MB of data and 8 4 M b chips for ECC. This allows 64 bits + 8 bits ECC to

More information

CSCI-564 Advanced Computer Architecture

CSCI-564 Advanced Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA

More information

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1> Chapter 5 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 5 Chapter 5 :: Topics Introduction Arithmetic Circuits umber Systems Sequential Building

More information

Fall 2008 CSE Qualifying Exam. September 13, 2008

Fall 2008 CSE Qualifying Exam. September 13, 2008 Fall 2008 CSE Qualifying Exam September 13, 2008 1 Architecture 1. (Quan, Fall 2008) Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for

More information

Power-Aware Branch Prediction: Characterization and Design

Power-Aware Branch Prediction: Characterization and Design Power-Aware Branch Prediction: Characterization and Design Dharmesh Parikh, Kevin Skadron, Yan Zhang, Mircea Stan Abstract This paper uses Wattch and the SPEC 2 integer and floating-point benchmarks to

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

256.bzip2, ref.source. Datasets profile vs. Reference Dataset. 256.bzip2, ref.source

256.bzip2, ref.source. Datasets profile vs. Reference Dataset. 256.bzip2, ref.source Datasets profile vs. Reference Dataset The following are the profiles for the benchmark. For more details about our profile development and dataset reduction methodology, refer to the paper by AJ KleinOsowski

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] What are the two main ways to define performance? [2] Predicting the direction of a branch is not enough. What else is necessary? [2] The power consumed by a chip has increased over time, but the clock

More information

A New Multiple Weight Set Calculation Algorithm

A New Multiple Weight Set Calculation Algorithm A New Multiple Weight Set Calculation Algorithm Hong-Sik Kim Jin-kyue Lee Sungho Kang hskim@dopey.yonsei.ac.kr jklee@cowboys.yonsei.ac.kr shkang@yonsei.ac.kr Dept. of Electrical Eng. Yonsei Univ. Shinchon-dong

More information

/ : Computer Architecture and Design

/ : Computer Architecture and Design 16.482 / 16.561: Computer Architecture and Design Summer 2015 Homework #5 Solution 1. Dynamic scheduling (30 points) Given the loop below: DADDI R3, R0, #4 outer: DADDI R2, R1, #32 inner: L.D F0, 0(R1)

More information

Department of Electrical and Computer Engineering The University of Texas at Austin

Department of Electrical and Computer Engineering The University of Texas at Austin Department of Electrical and Computer Engineering The University of Texas at Austin EE 360N, Fall 2004 Yale Patt, Instructor Aater Suleman, Huzefa Sanjeliwala, Dam Sunwoo, TAs Exam 1, October 6, 2004 Name:

More information

Design at the Register Transfer Level

Design at the Register Transfer Level Week-7 Design at the Register Transfer Level Algorithmic State Machines Algorithmic State Machine (ASM) q Our design methodologies do not scale well to real-world problems. q 232 - Logic Design / Algorithmic

More information

Project Two RISC Processor Implementation ECE 485

Project Two RISC Processor Implementation ECE 485 Project Two RISC Processor Implementation ECE 485 Chenqi Bao Peter Chinetti November 6, 2013 Instructor: Professor Borkar 1 Statement of Problem This project requires the design and test of a RISC processor

More information

Árpád Gellért Lucian N. Vinţan Adrian Florea. A Systematic Approach to Predict Unbiased Branches

Árpád Gellért Lucian N. Vinţan Adrian Florea. A Systematic Approach to Predict Unbiased Branches Árpád Gellért Lucian N. Vinţan Adrian Florea A Systematic Approach to Predict Unbiased Branches Lucian Blaga University Press Sibiu 2007 Tiparul executat la: Compartimentul de Multiplicare al Editurii

More information

Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives

Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives Design and FPGA Implementation of Radix-10 Algorithm for Division with Limited Precision Primitives Miloš D. Ercegovac Computer Science Department Univ. of California at Los Angeles California Robert McIlhenny

More information

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Instruction Set Extensions for Reed-Solomon Encoding and Decoding Instruction Set Extensions for Reed-Solomon Encoding and Decoding Suman Mamidi and Michael J Schulte Dept of ECE University of Wisconsin-Madison {mamidi, schulte}@caewiscedu http://mesaecewiscedu Daniel

More information

Fast Path-Based Neural Branch Prediction

Fast Path-Based Neural Branch Prediction Fast Path-Based Neral Branch Prediction Daniel A. Jiménez http://camino.rtgers.ed Department of Compter Science Rtgers, The State University of New Jersey Overview The context: microarchitectre Branch

More information

Lecture 13: Sequential Circuits, FSM

Lecture 13: Sequential Circuits, FSM Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines 1 Clocks A microprocessor is composed of many different circuits that are operating simultaneously if each

More information

CMP 338: Third Class

CMP 338: Third Class CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does

More information

1 RN(1/y) Ulp Accurate, Monotonic

1 RN(1/y) Ulp Accurate, Monotonic URL: http://www.elsevier.nl/locate/entcs/volume24.html 29 pages Analysis of Reciprocal and Square Root Reciprocal Instructions in the AMD K6-2 Implementation of 3DNow! Cristina Iordache and David W. Matula

More information

Loop Interchange. Loop Transformations. Taxonomy. do I = 1, N do J = 1, N S 1 A(I,J) = A(I-1,J) + 1 enddo enddo. Loop unrolling.

Loop Interchange. Loop Transformations. Taxonomy. do I = 1, N do J = 1, N S 1 A(I,J) = A(I-1,J) + 1 enddo enddo. Loop unrolling. Advanced Topics Which Loops are Parallel? review Optimization for parallel machines and memory hierarchies Last Time Dependence analysis Today Loop transformations An example - McKinley, Carr, Tseng loop

More information

A Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding

A Parallel Implementation of the. Yuan-Jye Jason Wu y. September 2, Abstract. The GTH algorithm is a very accurate direct method for nding A Parallel Implementation of the Block-GTH algorithm Yuan-Jye Jason Wu y September 2, 1994 Abstract The GTH algorithm is a very accurate direct method for nding the stationary distribution of a nite-state,

More information

ECE 341. Lecture # 3

ECE 341. Lecture # 3 ECE 341 Lecture # 3 Instructor: Zeshan Chishti zeshan@ece.pdx.edu October 7, 2013 Portland State University Lecture Topics Counters Finite State Machines Decoders Multiplexers Reference: Appendix A of

More information

Potentials of Branch Predictors from Entropy Viewpoints

Potentials of Branch Predictors from Entropy Viewpoints Potentials of Branch Predictors from Entropy Viewpoints Takashi Yokota,KanemitsuOotsu, and Takanobu Baba Department of Information Science, Utsunomiya University, 7 2 Yoto, Utsunomiya-shi, Tochigi, 32

More information

CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS KUNXIANG YAN, B.S. A thesis submitted to the Graduate School

CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS KUNXIANG YAN, B.S. A thesis submitted to the Graduate School CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS BY KUNXIANG YAN, B.S. A thesis submitted to the Graduate School in partial fulfillment of the requirements for the degree Master

More information

CSE370: Introduction to Digital Design

CSE370: Introduction to Digital Design CSE370: Introduction to Digital Design Course staff Gaetano Borriello, Brian DeRenzi, Firat Kiyak Course web www.cs.washington.edu/370/ Make sure to subscribe to class mailing list (cse370@cs) Course text

More information

Introduction The Nature of High-Performance Computation

Introduction The Nature of High-Performance Computation 1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential

More information

A HIGH-SPEED PROCESSOR FOR RECTANGULAR-TO-POLAR CONVERSION WITH APPLICATIONS IN DIGITAL COMMUNICATIONS *

A HIGH-SPEED PROCESSOR FOR RECTANGULAR-TO-POLAR CONVERSION WITH APPLICATIONS IN DIGITAL COMMUNICATIONS * Copyright IEEE 999: Published in the Proceedings of Globecom 999, Rio de Janeiro, Dec 5-9, 999 A HIGH-SPEED PROCESSOR FOR RECTAGULAR-TO-POLAR COVERSIO WITH APPLICATIOS I DIGITAL COMMUICATIOS * Dengwei

More information

Logic BIST. Sungho Kang Yonsei University

Logic BIST. Sungho Kang Yonsei University Logic BIST Sungho Kang Yonsei University Outline Introduction Basics Issues Weighted Random Pattern Generation BIST Architectures Deterministic BIST Conclusion 2 Built In Self Test Test/ Normal Input Pattern

More information

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary EECS50 - Digital Design Lecture - Shifters & Counters February 24, 2003 John Wawrzynek Spring 2005 EECS50 - Lec-counters Page Register Summary All registers (this semester) based on Flip-flops: q 3 q 2

More information

Computer Science Introductory Course MSc - Introduction to Java

Computer Science Introductory Course MSc - Introduction to Java Computer Science Introductory Course MSc - Introduction to Java Lecture 1: Diving into java Pablo Oliveira ENST Outline 1 Introduction 2 Primitive types 3 Operators 4 5 Control Flow

More information

A framework for the timing analysis of dynamic branch predictors

A framework for the timing analysis of dynamic branch predictors A framework for the timing analysis of dynamic ranch predictors Claire Maïza INP Grenole, Verimag Grenole, France claire.maiza@imag.fr Christine Rochange IRIT - CNRS Université de Toulouse, France rochange@irit.fr

More information

ENEE350 Lecture Notes-Weeks 14 and 15

ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl s Law ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining is a method of processing in which a problem is divided into a number of sub problems and solved and the solu8ons of the sub problems

More information

Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions

Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions In the Proceedings of the International Conference on Dependable Systems and Networks(DSN 07), June 2007. Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions Xiaodong Li,

More information

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur March 19, 2018 Modular Exponentiation Public key Cryptography March 19, 2018 Branch Prediction Attacks 2 / 54 Modular Exponentiation

More information

A Fast Head-Tail Expression Generator for TCAM Application to Packet Classification

A Fast Head-Tail Expression Generator for TCAM Application to Packet Classification 202 IEEE Computer Society Annual Symposium on VLSI A Fast Head-Tail Expression Generator for TCAM Application to Packet Classification Infall Syafalni and Tsutomu Sasao Department of Computer Science and

More information

Lecture 13: Sequential Circuits, FSM

Lecture 13: Sequential Circuits, FSM Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines Reminder: midterm on Tue 2/28 will cover Chapters 1-3, App A, B if you understand all slides, assignments,

More information

Built-In Test Generation for Synchronous Sequential Circuits

Built-In Test Generation for Synchronous Sequential Circuits Built-In Test Generation for Synchronous Sequential Circuits Irith Pomeranz and Sudhakar M. Reddy + Electrical and Computer Engineering Department University of Iowa Iowa City, IA 52242 Abstract We consider

More information

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control Logic and Computer Design Fundamentals Chapter 8 Sequencing and Control Datapath and Control Datapath - performs data transfer and processing operations Control Unit - Determines enabling and sequencing

More information

4. (3) What do we mean when we say something is an N-operand machine?

4. (3) What do we mean when we say something is an N-operand machine? 1. (2) What are the two main ways to define performance? 2. (2) When dealing with control hazards, a prediction is not enough - what else is necessary in order to eliminate stalls? 3. (3) What is an "unbalanced"

More information

Accurate Estimation of Cache-Related Preemption Delay

Accurate Estimation of Cache-Related Preemption Delay Accurate Estimation of Cache-Related Preemption Delay Hemendra Singh Negi Tulika Mitra Abhik Roychoudhury School of Computing National University of Singapore Republic of Singapore 117543. [hemendra,tulika,abhik]@comp.nus.edu.sg

More information

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Yuchun Ma* Zhuoyuan Li* Jason Cong Xianlong Hong Glenn Reinman Sheqin Dong* Qiang Zhou *Department of Computer Science &

More information

On my honor, as an Aggie, I have neither given nor received unauthorized aid on this academic work

On my honor, as an Aggie, I have neither given nor received unauthorized aid on this academic work Lab 5 : Linking Name: Sign the following statement: On my honor, as an Aggie, I have neither given nor received unauthorized aid on this academic work 1 Objective The main objective of this lab is to experiment

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 09/10, Jan., 2018 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 43 Most Essential Assumptions for Real-Time Systems Upper

More information

On Application of Output Masking to Undetectable Faults in Synchronous Sequential Circuits with Design-for-Testability Logic

On Application of Output Masking to Undetectable Faults in Synchronous Sequential Circuits with Design-for-Testability Logic On Application of Output Masking to Undetectable Faults in Synchronous Sequential Circuits with Design-for-Testability Logic Irith Pomeranz 1 and Sudhakar M. Reddy 2 School of Electrical & Computer Eng.

More information

Cache-based Query Processing for the Boolean Retrieval Model

Cache-based Query Processing for the Boolean Retrieval Model Association for Information Systems AIS Electronic Library (AISeL) ECIS 2000 Proceedings European Conference on Information Systems (ECIS) 2000 Cache-based Query Processing for the Boolean Retrieval Model

More information

UNIVERSITY OF WISCONSIN MADISON

UNIVERSITY OF WISCONSIN MADISON CS/ECE 252: INTRODUCTION TO COMPUTER ENGINEERING UNIVERSITY OF WISCONSIN MADISON Prof. Gurindar Sohi TAs: Minsub Shin, Lisa Ossian, Sujith Surendran Midterm Examination 2 In Class (50 minutes) Friday,

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 9

ECE 571 Advanced Microprocessor-Based Design Lecture 9 ECE 571 Advanced Microprocessor-Based Design Lecture 9 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 20 February 2018 Announcements HW#4 was posted. About branch predictors Don

More information

Test Pattern Generator for Built-in Self-Test using Spectral Methods

Test Pattern Generator for Built-in Self-Test using Spectral Methods Test Pattern Generator for Built-in Self-Test using Spectral Methods Alok S. Doshi and Anand S. Mudlapur Auburn University 2 Dept. of Electrical and Computer Engineering, Auburn, AL, USA doshias,anand@auburn.edu

More information

ww.padasalai.net

ww.padasalai.net t w w ADHITHYA TRB- TET COACHING CENTRE KANCHIPURAM SUNDER MATRIC SCHOOL - 9786851468 TEST - 2 COMPUTER SCIENC PG - TRB DATE : 17. 03. 2019 t et t et t t t t UNIT 1 COMPUTER SYSTEM ARCHITECTURE t t t t

More information

Primary Outputs. Primary Inputs. Combinational Logic. Latches. Next States. Present States. Clock R 00 0/0 1/1 1/0 1/0 A /1 0/1 0/0 1/1

Primary Outputs. Primary Inputs. Combinational Logic. Latches. Next States. Present States. Clock R 00 0/0 1/1 1/0 1/0 A /1 0/1 0/0 1/1 A Methodology for Ecient Estimation of Switching Activity in Sequential Logic Circuits Jose Monteiro, Srinivas Devadas Department of EECS, MIT, Cambridge Bill Lin IMEC, Leuven Abstract We describe a computationally

More information

The Non-existence of Finite Projective Planes of. Order 10. C. W. H. Lam, L. Thiel, and S. Swiercz. 15 January, 1989

The Non-existence of Finite Projective Planes of. Order 10. C. W. H. Lam, L. Thiel, and S. Swiercz. 15 January, 1989 The Non-existence of Finite Projective Planes of Order 10 C. W. H. Lam, L. Thiel, and S. Swiercz 15 January, 1989 Dedicated to the memory of Herbert J. Ryser Abstract This note reports the result of a

More information

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018

Computer arithmetic. Intensive Computation. Annalisa Massini 2017/2018 Comuter arithmetic Intensive Comutation Annalisa Massini 7/8 Intensive Comutation - 7/8 References Comuter Architecture - A Quantitative Aroach Hennessy Patterson Aendix J Intensive Comutation - 7/8 3

More information

7 Multipliers and their VHDL representation

7 Multipliers and their VHDL representation 7 Multipliers and their VHDL representation 7.1 Introduction to arithmetic algorithms If a is a number, then a vector of digits A n 1:0 = [a n 1... a 1 a 0 ] is a numeral representing the number in the

More information

EECS150 - Digital Design Lecture 25 Shifters and Counters. Recap

EECS150 - Digital Design Lecture 25 Shifters and Counters. Recap EECS150 - Digital Design Lecture 25 Shifters and Counters Nov. 21, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John

More information

Fault Modeling. 李昆忠 Kuen-Jong Lee. Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan. VLSI Testing Class

Fault Modeling. 李昆忠 Kuen-Jong Lee. Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan. VLSI Testing Class Fault Modeling 李昆忠 Kuen-Jong Lee Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan Class Fault Modeling Some Definitions Why Modeling Faults Various Fault Models Fault Detection

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 10

ECE 571 Advanced Microprocessor-Based Design Lecture 10 ECE 571 Advanced Microprocessor-Based Design Lecture 10 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 23 February 2017 Announcements HW#5 due HW#6 will be posted 1 Oh No, More

More information

Combinations. April 12, 2006

Combinations. April 12, 2006 Combinations April 12, 2006 Combinations, April 12, 2006 Binomial Coecients Denition. The number of distinct subsets with j elements that can be chosen from a set with n elements is denoted by ( n j).

More information

Multicore Semantics and Programming

Multicore Semantics and Programming Multicore Semantics and Programming Peter Sewell Tim Harris University of Cambridge Oracle October November, 2015 p. 1 These Lectures Part 1: Multicore Semantics: the concurrency of multiprocessors and

More information

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins 11 Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins Wendy OSBORN a, 1 and Saad ZAAMOUT a a Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge,

More information

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Marwan Burelle.  Parallel and Concurrent Programming. Introduction and Foundation and and marwan.burelle@lse.epita.fr http://wiki-prog.kh405.net Outline 1 2 and 3 and Evolutions and Next evolutions in processor tends more on more on growing of cores number GPU and similar extensions

More information

Lecture 2: Metrics to Evaluate Systems

Lecture 2: Metrics to Evaluate Systems Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video

More information

Naive Bayesian classifiers for multinomial features: a theoretical analysis

Naive Bayesian classifiers for multinomial features: a theoretical analysis Naive Bayesian classifiers for multinomial features: a theoretical analysis Ewald van Dyk 1, Etienne Barnard 2 1,2 School of Electrical, Electronic and Computer Engineering, University of North-West, South

More information

Goals for Performance Lecture

Goals for Performance Lecture Goals for Performance Lecture Understand performance, speedup, throughput, latency Relationship between cycle time, cycles/instruction (CPI), number of instructions (the performance equation) Amdahl s

More information

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example This Unit: Scheduling (Static + Dnamic) CIS 50 Computer Architecture Unit 8: Static and Dnamic Scheduling Application OS Compiler Firmware CPU I/O Memor Digital Circuits Gates & Transistors! Previousl:!

More information

Improving Memory Hierarchy Performance Through Combined Loop. Interchange and Multi-Level Fusion

Improving Memory Hierarchy Performance Through Combined Loop. Interchange and Multi-Level Fusion Improving Memory Hierarchy Performance Through Combined Loop Interchange and Multi-Level Fusion Qing Yi Ken Kennedy Computer Science Department, Rice University MS-132 Houston, TX 77005 Abstract Because

More information

DSP Design Lecture 5. Dr. Fredrik Edman.

DSP Design Lecture 5. Dr. Fredrik Edman. SP esign SP esign Lecture 5 Retiming r. Fredrik Edman fredrik.edman@eit.lth.se Fredrik Edman, ept. of Electrical and Information Technology, Lund University, Sweden-www.eit.lth.se SP esign Repetition Critical

More information