CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS KUNXIANG YAN, B.S. A thesis submitted to the Graduate School

Size: px
Start display at page:

Download "CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS KUNXIANG YAN, B.S. A thesis submitted to the Graduate School"

Transcription

1 CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS BY KUNXIANG YAN, B.S. A thesis submitted to the Graduate School in partial fulfillment of the requirements for the degree Master of Science in Electrical Engineering New Mexico State University Las Cruces, New Mexico May 2004

2 "Characterization and Classification of Modern Micro-Processor Benchmarks," a thesis prepared by Kunxiang Yan in partial fulfillment of the requirements for the degree, Master of Science in Electrical Engineering, has been approved and accepted by the following: Linda Lacey Dean of the Graduate School Jeanine M. Cook Chair of the Examining Committee Date Committee in charge: Dr. Jeanine M. Cook, Chair Dr. Javin Taylor Dr. Richard Oliver ii

3 VITA 1976 Born in Changsha (Hunan), China 1994 Graduated from The Second High School, XiangTan, Hunan, China B.S. Electrical Engineering, Hunan University of Science & Technology, China Engineer, FOXCONN Ltd, Shenzhen, Guangdong, China Teaching Assistant, Department of ECE, New Mexico State University PROFESSIONAL AND HONORARY SOCIETES Institute of Electrical and Electronic Engineers (IEEE) FIELD OF STUDY Major Field: Electrical & Computer Engineering iii

4 ABSTRACT CHARACTERIZATION AND CLASSIFICATION OF MODERN MICRO-PROCESSOR BENCHMARKS BY KUNXIANG YAN, B.S. Master of Science in Electrical Engineering New Mexico State University Las Cruces, New Mexico, 2004 Dr. Jeanine M. Cook, Chair This work examines non-traditional techniques for workload characterization and classification analysis of different benchmark suites used in computer system performance analysis. In current computer systems research, benchmarks are often used to evaluate performance to support micro-architecture design. In this work, two different benchmark suites: SPEC CPU2000 and Berkeley Multimedia are simulated using the Simple-Scalar 3.0 iv

5 processor simulator tool set and their performance is characterized. We examine IPC behavior and instruction mix composition, perform feature correlation using a large set of performance metrics and extract the principal components of the metrics dataset. Additionally, C5.0 is used to construct a classifier to classify the benchmark suite. The objective of this work is to find characteristics that distinguish different classes of workloads. If a processor can automatically recognize the type of workload according to its distinguishing characteristics, the processor can be dynamically reconfigured to optimize performance for that workload according to its class. v

6 TABLE OF CONTENTS LIST OF TABLES... LIST OF FIGURES... x xii 1 INTRODUCTION RELATED WORK BACKGROUND Metrics Superscalar Architectures METHODLOGY Benchmark Suites Simple-Scalar Simulators Statistical Tool SAS C5.0 Decision Tree Classifier Sampling and Experimental Methods CHARACTERIZAITON ANALYSIS IPC Behavior Initialization Classification of IPC Behavior Feature Correlation vi

7 5.2.1 Pearson Correlation Coefficient Standard Correlation Coefficient Two Independent t-test Mean of Pair Correlation (MOPC) Benchmark IPC Behavior vs MOC Principal Components Analysis (PCA) Calculating Principal Components First Two Principal Components Principal Component one (PC1) Time Series Analysis Principal Component one (PC1) Frequency Conclusion on Principal Components Analysis Instruction Mix and Branch Mix Comparison Reference Instructions vs Committed Instructions Instruction Mix and Branch Mix C 5.0 DECISION TREE CLASSIFICATION Classifier with Eight Features Input Justification for Four Important Features CONCLUSION AND FUTURE WORK Conclusion Future Work vii

8 A CORRELATION COEFFICIENT TABLE AND STANDARD CORRELATION COEFFICIENT TABLE B PC1 AND PC2 (PCA PLOT) C PC1 TIME SERIES D PC1 FREQUENCY E FEATURE BEHAVIOR F C5.0 CLASSIFIER RULES AND MEAN ERROR RATE REFERENCES viii

9 LIST OF TABLES 1 Description of SPEC2000INT Benchmark Suite Description of SPEC2000FP Benchmark Suite Description of Berkeley Multimedia Benchmark Suite Default Micro-architecture Configuration in Simple-Scalar IPC Behavior Classification Table IPC Behavior Classification Rules Correlation Coefficient Table of SPEC2000INT Standard Correlation Coefficient Table of SPEC2000INT Standard Coefficient Table of Highly Correlated Feature Pairs MOPC Comparison Benchmark IPC Behavior vs MOC Percentage of Information Represented by PC1 and PC PCA Plot Classification PCA Plot Correlation PC1 Time Series Classification PC1 Time Series Correlation PC1 Frequency Classification Rules PC1 Frequency Classification ix

10 19 PC1 Frequency Correlation Distinguishing PCA Characteristics of the Benchmark Suites Benchmarks Suites Program Behavior from PCA Total Number of Committed and Reference Instructions PDF of Instructions and Branches Composition Standard Correlation Coefficient Table of (1) IFQU, RUUU and LSQU, (2) I, D and L Mean Error Rate of the Classifiers The Exceptions in Each Benchmark Suite Techniques for Dynamical Workload Classification A1 Correlation Coefficient Table of SPEC2000INT A2 Standard Correlation Coefficient Table of SPEC2000INT A3 Correlation Coefficient Table of SPEC2000FP A4 Standard Correlation Coefficient Table of SPEC2000FP A5 Correlation Coefficient Table of Berkeley Multimedia A6 Standard Correlation Coefficient Table of Berkeley Multimedia x

11 LIST OF FIGURES 1 Five Stages Pipeline of Modern Processor Simple Superscalar Architecture Simple Binary Decision Tree Rules Generated from the Decision Tree in Figure Example of Initialization Phase of Swim Periodic IPC Behavior (Swim) Irregular IPC Behavior (Parser) Constant IPC Behavior (Twolf) Con/Per IPC Behavior (Mesa) Con/Irre IPC Behavior (Art) Entire IPC behavior of Gcc Confidence Ellipse of the IFQU-I (Vortex) Confidence Ellipse of the D-L2 (Vortex) Feature Behavior of GSM Dec (Berkeley Multimedia) Feature Behavior of GSM Enc (Berkeley Multimedia) Simple Example of the Principal Components Variance Represented by the Principal Components of Gzip PCA Plot of Crafty (SPEC2000INT) xi

12 19 PCA Plot of Applu (SPEC2000FP) PCA Plot of Ghostscript (Berkeley Multimedia) PCA Plot of Eon (SPEC2000INT) PCA Plot of Bzip2 (SPEC2000INT) PCA Plot of Vortex (SPEC2000INT) PCA Plot of Parser (SPEC2000INT) PCA Plot of Art (SPEC2000FP) PCA Plot of Mesa (SPEC2000FP) PCA Plot of Galgel (SPEC2000FP) PCA Plot of Equake (SPEC2000FP) PCA Plot of Lucas (SPEC2000FP) PCA Plot of Gsm dec (Berkeley Multimedia) PCA Plot of Mpg 123 (Berkeley Multimedia) PC1 Time Series of Crafty (SPEC2000INT) PC1 Time Series of Applu (SPEC2000FP) PC1 Time Series of Ghostscript (Berkeley Multimedia) PC1 Time Series of Eon (SPEC2000INT) PC1 Time Series of Bzip2 (SPEC2000INT) PC1 Time Series of Gcc (SPEC2000INT) PC1 Time Series of Vortex (SPEC2000INT) xii

13 39 PC1 Time Series of Mcf (SPEC2000INT) PC1 Time Series of Art (SPEC2000FP) PC1 Time Series of Galgel (SPEC2000FP) PC1 Time Series of Mpeg2 DVD En (Berkeley Multimedia) PC1 Frequency of Crafty (SPEC2000INT) PC1 Frequency of Applu (SPEC2000FP) PC1 Frequency of Mpeg2 dec DVD (Berkeley Multimedia) PC1 Frequency of Mpg123 (Berkeley Multimedia) PC1 Frequency of Eon (SPEC2000INT) PC1 Frequency of Bzip2 (SPEC2000INT) PC1 Frequency of Gcc (SPEC2000INT) PC1 Frequency of Vortex (SPEC2000INT) PC1 Frequency of Mcf (SPEC2000INT) PC1 Frequency of Galgel (SPEC2000FP) PC1 Time Series of Parser (SPEC2000INT) PC1 Time Series of Art (SPEC2000FP) PC1 Frequency of Parser (SPEC2000INT) PC1 Frequency of Art (SPEC2000FP) PCA Characteristics Clustering Tree Confidence Ellipse of the IPC-D (Twolf) xiii

14 59 Total Number of Committed and Reference Instructions Average Number of Committed and Reference Instructions Ratio of Reference instructions/committed instructions Comparison of Benchmark Instruction Mix Comparison of Average Benchmark Instruction Mix Comparison of Benchmark Branch Mix Comparison of Average Benchmark Branch Mix B1 PCA plot of Gzip B2 PCA plot of Gcc B3 PCA plot of Twolf B4 PCA plot of Vpr B5 PCA plot of Mcf B6 PCA plot of Swim B7 PCA plot of Mgrid B8 PCA plot of Ammp B9 PCA plot of Apsi B10 PCA plot of Gsm enc B11 PCA plot of Mpeg2 dec DVD B12 PCA plot of Mpeg2 enc DVD C1 PC1 Time Series of Gzip xiv

15 C2 PC1 Time Series of Twolf C3 PC1 Time Series of Vpr C4 PC1 Time Series of Swim C5 PC1 Time Series of Equake C6 PC1 Time Series of Mgrid C7 PC1 Time Series of Ammp C8 PC1 Time Series of Mesa C9 PC1 Time Series of Lucas C10 PC1 Time Series of Apsi C11 PC1 Time Series of Mpg C12 PC1 Time Series of Gsm dec C13 PC1 Time Series of Gsm en C14 PC1 Time Series of Mpeg2 dec DVD D1 PC1 Frequency of Gzip D2 PC1 Frequency of Twolf D3 PC1 Frequency of Vpr D4 PC1 Frequency of Swim D5 PC1 Frequency of Equake D6 PC1 Frequency of Mgrid D7 PC1 Frequency of Mesa xv

16 D8 PC1 Frequency of Lucas D9 PC1 Frequency of Ammp D10 PC1 Frequency of Apsi D11 PC1 Frequency of Ghostscript D12 PC1 Frequency of Gsm dec D13 PC1 Frequency of Gsm enc D14 PC1 Frequency of Mpeg2 enc DVD E1 Feature Behavior of Gzip E2 Feature Behavior of Gcc E3 Feature Behavior of Crafty E4 Feature Behavior of Vortex E5 Feature Behavior of Vpr E6 Feature Behavior of Mcf E7 Feature Behavior of Eon E8 Feature Behavior of Bzip E9 Feature Behavior of Equake E10 Feature Behavior of Mgrid E11 Feature Behavior of Applu E12 Feature Behavior of Ammp E13 Feature Behavior of Galgel xvi

17 E14 Feature Behavior of Lucas E15 Feature Behavior of Apsi E16 Feature Behavior of Ghostscript E17 Feature Behavior of Mpg E18 Feature Behavior of Mpeg2 dec DVD E19 Feature Behavior of Mpeg2 enc DVD xvii

18 1 INTRODUCTION For many years, modern micro-processors have been studied widely to enhance their performance. The processor architecture, cache performance, and branch prediction are some of the components studied by researchers to realize that goal. The characteristics of different benchmarks influence the processor performance. As we believe, if a processor can dynamically change its hardware configuration based on different types of workload, the performance of the processor can be increased. This leads us to find the distinguishing characteristics to classify the different classes of benchmarks. There are many benchmark suites used in the performance evaluation of modern computer systems. These benchmark suites are grouped into different categories based on what they do: CPU benchmarks (uniprocessor and parallel processor), Multimedia, Java (client side and server side), Transaction Processing (OLTP and DSS) and Web Server. Researchers use these specific benchmarks to represent all workloads in a particular class. SPEC CPU2000 is one of the uniprocessor benchmark suites commonly used in processor design, analysis and evaluation. Berkeley Multimedia is a newly developed benchmark suite used to study a processor s support for multimedia applications. In this work, we apply various workload characterization techniques to the SPEC CPU2000 and Berkeley Multimedia benchmark suites to identify their similarities and differences. Workload characterization involves defining and measuring metrics that describe the behavior of an application. To characterize benchmarks, we define 1

19 metrics that can be measured during native execution or simulation. These metrics describe the behavior of benchmark and can be applied to all benchmarks. Three different benchmark classes are chosen to study in this work: SPEC2000INT (integer benchmarks), SPEC2000FP (floating point benchmarks), Berkeley Multimedia (video decode or encode benchmarks, audio decode or encode benchmarks, 3D graphics benchmarks). We are looking for a metric or metrics that will enable us to distinguish between benchmarks of different classes. The original metrics collected from the simulations for the above benchmark suites are: IPC, IPB, RUUU, IFQU, LSQU, I (I-cache miss ratio), D (D-cache miss ratio) and L2 (L2-cache miss ratio). Statistics are used widely in workload characterization [15, 20]. In this work, several statistical techniques are used in an attempt to determine the distinguishing characteristics. These include feature correlation and Principal Component Analysis (PCA). We apply statistical tests to metrics such as IPC (instruction per cycle), and instruction and branch composition to determine differences between workload classes. Benchmark classifications based on the distinguishing characteristics are made in each analysis, and the correlations between different classes are also analyzed. Finally several metrics are selected to construct an accurate classifier using the C5.0 decision tree classification tool to classify different benchmarks. The rest of this thesis is organized as follows: Section 2 introduces related work in the field; Section 3 describes pertinent background concepts such as the metrics and the Superscalar architecture used in this work; Section 4 describes the benchmark 2

20 suites, simulator, statistics analysis tool, C5.0 decision tree classifier and experimental methods; In Section 5 we present the characterization results: IPC behavior, feature correlation, Principal Component Analysis (PCA), instruction and branch type composition; Section 6 describes how to construct an accurate classifier using the C5.0 classification tool; Section 7 makes a conclusion based on the work we have done; Section 8 outlines ideas for future work. 3

21 2 RELATED WORK To the best of our knowledge, there is no existing work that determines the specific characteristics which can be used to distinguish different classes of benchmarks. However, we find some efforts in this direction. In [27], the behavior of the SPEC95 benchmark suite and the correlation between IPC, branch prediction, load value prediction, load address prediction, cache miss ratio and RUU occupancy is presented. Where the simulation should start and how long the simulation should run to capture the total behavior of the SPEC95 benchmarks is discussed. They show that each program undergoes the following stages of execution: Initialization-SteadyState-Finish. They demonstrate that the initialization phase is not representative of the true program behavior so it should not be included in analysis. Program behavior after the initialization phase is characterized and classified into two categories: (1) visible large scale cyclic behavior repeated until program completion (Vortex, Applu, Fppp, Mgrid, Tomcatv, Turb3d, Wave5 and Su2cor), (2) non-cyclic steady state behavior throughout program execution (Gcc, Go, Li, M88ksim, Apsi, Hydro2d and Swim). This paper gives us a direction on how to classify the IPC behavior of benchmarks. Commonly, computer micro-architecture designers simulate a small section of a benchmark and assume it to be representative of the whole program. In [26], a methodology is described to find the smallest program subset that represents the entire program s behavior using Basic Block Distribution analysis. We use the information on the initialization phase and cyclic (periodic) behavior of the SPEC 4

22 CPU2000 programs from this paper. In [14], the author describes the goals that can be realized by using workload characterization. It points out that workload characterizations can be used for: (1) understanding workloads and the effective interpretation of simulation results, (2) designing machines to suit particular workloads or adapting the micro-architecture to suit the program, (3) guiding the choice of programs to be contained in benchmark suites, and (4) creating a program behavior model to be used in an analytical performance model of computer systems. It also specifies the micro-architecture independent metrics: the static instruction size, dynamic instruction size, instruction mix, control transfer instruction mix (direct branches, indirect branches, calls, jumps, returns) and consistency of branch targets. The cache hit ratios, cache bus utilization, branch prediction rate and throughput are micro-architecture dependent. In [13, 20], different techniques used for performance evaluation are introduced. Performance measurement includes on-chip performance monitoring counters, off-chip hardware monitoring, software monitoring and micro-coded instrumentation. Simulation includes trace-driven, execution-driven and statistical simulation. Several benchmarks used in micro-architecture research are also studied. These papers present an overall view of the commonly used techniques and tools in system performance analysis. In [9], the author presents a method to reduce the time required by simulation by dynamically varying the complexity of the processor model in the simulation. He describes how to construct an accurate classifier using C5.0 to distinguish two 5

23 processor models defined by the author: No pipeline model and Full pipeline model. The accuracy of the classifier constructed by C5.0 is found to be good by the author. This prompts us to choose C5.0 for our study. Nathan T. Slingerland presents the Berkeley Multimedia benchmarks in [18], which are developed to be used in computer micro-architecture performance evaluation. The paper describes the function and characterization of MPEG, Intel Media Benchmark, UCLA MediaBench and Berkeley Multimedia benchmarks and observes the difference between Berkeley Multimedia and SPEC95 benchmarks in an instruction mix comparison. He points out that SPEC95FP has more floating point operations, while SPEC95INT executes more branch instructions. Multimedia benchmarks are characterized by a greater number of shift/logical operations than SPEC95 benchmarks. The issues pertaining to and methodologies for workload characterization are described in [15]. They state that descriptive statistics of workload parameters and clustering are the two most commonly used approaches in workload characterization analysis. Descriptive statistics (such as mean value, maximum value, minimum value and standard error) are used to describe the summaries about the data, and clustering is used to classify the dataset. In [16], the author presents a survey of methodologies for the construction of workload models of different computer systems (batch, interactive, database, network-based, parallel and supercomputer). This paper identifies five common steps for workload characterization analysis: (1) choice of the set of parameters for 6

24 describing the workload behavior, (2) choice of the appropriate tools, (3) experimental measurement collection, (4) analysis of workload data, and (5) construction of workload models. The properties, algorithm and applications of principal component analysis (PCA) are presented in [2, 12 and 17]. Principal components analysis (PCA) is a statistical data analysis technique that is based on the assumption that many variables in a set of data are correlated and hence measure the same or similar properties of the input dataset. The correlations between the variables make it hard to find the important characteristics. Therefore, the principal components are generated for analyzing the input dataset because all of the principal components are linear combinations of the original dataset and they are uncorrelated. In [2], the author introduces how to find the similarities in micro-processor program behavior by applying PCA and Clustering to the input dataset. The program behavior of the SPEC2000INT benchmark suite including reference and train input sets using PCA is studied in [12]. They conclude that almost all the program-input pairs show that the train input sets can be treated as representative of their respective reference program-input pairs. In this section, we highlight some important works on processor simulation, workload characterization and classification. Our objective is to find the characteristics that distinguish workloads and classify different benchmark suites. We give the background of our work in the next section. 7

25 3 BACKGROUND In this section, the background knowledge of our work is introduced. We describe the data we use in the workload characterization. We also present the micro-architecture and characteristics of Superscalar processors. 3.1 Metrics Metrics are measured values that define behavior [21]. They specify what should be measured to compare different objects. The eight representative metrics that we choose to analyze benchmark characteristics are: IPC: Instruction per cycle. IPB: Instruction per branch. RUUU: Register update unit utilization. IFQU: Instruction fetch queue utilization. LSQU: Load and store queue utilization. I: I-cache miss ratio. D: D-cache miss ratio. L2: L2-cache miss ratio. L1 I-cache is a fast memory that holds the fetched instructions; L1 D-cache is also a fast memory that holds the data used by instructions; the unified L2-cache holds both instructions and data, and is larger but slower than L1 cache. Cache miss 8

26 ratio is defined as the fraction of the total cache accesses that result in a cache miss. It is a common metric used to measure cache performance. The utilization rates measure the average occupancy of various micro-architecture components. RUUU, IFQU and LSQU measure the utilization of the RUU, IFQ and LSQ. Their functions are described in Section 3.2; their size and configurations are listed in Table Superscalar Architectures Contemporary microprocessors are Superscalar and execute instructions out-of-order. Superscalar refers to the issue of multiple instructions every cycle, a typical Superscalar processor fetches and decodes several instructions at the same time [7]. Out-of-order execution allows instructions to be executed in an order other than the program order. A modern micro-processor has three special characteristics [9]: 1. Superscalar instruction fetch/issue. Multiple instructions can be fetched and issued every clock cycle. Every processor implements a pipeline but the number of stages may be different. As a commonly used pipeline, an instruction is executed in following stages: IF, ID, EX, MEM and WB (Figure 1), so the total cycles needed to finish one instruction is at most five. 9

27 EX IF ID.. MEM WB EX IF: Instruction fetch ID: Instruction decode EX: Execution MEM: Memory fetch WB: Write back Figure 1: Five Stages Pipeline of Modern Processor A typical CPU contains multiple functional units of the same type in order to exploit instruction-level parallelism (ILP) [3]. There are multiple execution units at the EX stage. Instructions are first fetched from the I-cache and stored in an instruction buffer. One or more instructions are then issued to the reservation station in RUUU then dispatched from there to the execution units. Finally the results generated by the execution units will be stored into the reorder buffer and written into registers [3, 11]. A simple Superscalar architecture is described in Figure 2. This processor can execute two integer instructions, two floating point instructions, one branch and one load/store at the same clock cycle (Figure 2). The instruction fetch queue (IFQ) is the instruction buffer used to store the instructions fetched from I-cache. The decoded load/store instructions are sent from the instruction fetch queue to the load/store queue (LSQ), others are sent to the register update unit (RUU). The address used in load/store instructions is calculated in the address generation unit (AGU). The LSQ is used to hold speculative store values 10

28 and load instructions [4] that will be sent to the memory system. From the memory system the desired load value is sent to the RUU. The RUU implements a reorder buffer to automatically rename registers and hold the results of pending instructions; it is also used to hold the values loaded from memory [4]. I-cache IFQ Integer Unit FP Unit AGU RUU D-cache LSQ Branch Unit Load/store Unit I-cache: Instruction cache D-cache: Data cache IFQ: Instruction fetch queue LSQ: Load/store queue Branch Unit: Branch prediction unit RUU: Register update unit Integer Unit: Integer execution unit FP Unit: Floating point execution unit Load/Store Unit: l/s execution unit AGU: Address generate unit Figure 2: Simple Superscalar Architecture 2. Out-of-order execution Out-of-order execution allows instructions to be issued to the functional units in an order different from program order. The main obstacles to out-of-order execution are instruction dependencies. There are three different types of dependencies: data 11

29 dependency, name dependency and control dependency. Data dependency is the most common. An instruction j is data dependent on instruction i if either of following holds: Instruction i produces a result that may be used by instruction j. Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. If two instructions are data dependent, they can not be executed simultaneously. To sustain the instruction-level parallelism, scheduling independent instructions for execution is done to avoid a hazard caused by data dependency in the Superscalar processor [3]. 3. Speculative execution Speculation enables the processor to execute conditional instructions based on prediction. Conditional instructions include branches and conditional load instructions. Instructions on the speculative path can be fetched and executed ahead of schedule. If the prediction is correct, the speculatively executed results are used and hence throughput increases. If the prediction is found to be wrong, the speculatively executed instructions will be flushed from the pipeline. 12

30 4 METHODOLOGY In this section, we describe the benchmark suites, tools and experimental methods used in this work. In this work, SPEC CPU2000 and Berkeley Multimedia benchmark suites are studied. The Simple-Scalar 3.0 tool set is used as the experimental micro-architecture simulator. SAS V8.0 is used to find the statistics pertaining to the benchmark metrics and the C5.0 classifier construction tool is used to construct the classifiers based on workload metrics. IPC behavior, feature correlation, principal component analysis (PCA), instruction and branch composition, and classification are the experimental metrics and methods examined to find the distinguishing characteristics between SPEC2000INT, SPEC2000FP and Berkeley Multimedia benchmark suites. 4.1 Benchmark Suites Benchmarks are programs used to evaluate or study the performance of the processor, memory and compiler on a target system. In this work two different benchmark suites: SPEC CPU2000 and Berkeley Multimedia are studied. SPEC CPU2000: It is an industry-standardized, CPU-intensive benchmark suite used to measure the performance of a processor [25]. The SPEC CPU2000 suite is divided into two categories: CINT2000 and CFP2000. The former are integer benchmarks written in C and C++, and the later are floating point benchmarks written in C and Fortran. Table 1 and Table 2 show the benchmarks that we choose from the 13

31 SPEC CPU2000 benchmark suite to study. Berkeley Multimedia: It is a newly developed benchmark suite used to study a processor s support for multimedia workloads. It includes video decode and encode benchmarks, audio decode and encode benchmarks and 3D graphics benchmarks [18] as shown in Table 3. There are a total of twelve integer and fourteen floating-point benchmarks in SPEC CPU2000. Ten benchmarks each from SPEC2000INT and SPEC2000FP, and six from Berkeley Multimedia are chosen for this study. In the SPEC CPU2000 benchmark suites, every benchmark has three different input sets: train, test and reference. The train input set is the smallest and the reference input set is the largest. In this work, we choose the reference input set to study. Table 1: Description of SPEC2000INT Benchmark Suite [25] SPEC2000INT Code Description 164.Gzip C Compression 176.Gcc C C Programming Language Compiler 186.Crafty C Game Playing: Chess 255.Vortex C Object-oriented Database 300.Twolf C Place and Route Simulator 175.Vpr C FPGA Circuit Placement and Routing 181.Mcf C Combinatorial Optimization 197.Parser C Word Processing 252.Eon C++ Computer Visualization 256.Bzip2 C Compression 14

32 Table 2: Description of SPEC2000FP Benchmark Suite [25] SPEC2000FP Code Description 171.Swim Fortran 77 Shallow Water Modeling 183.Equake C Seismic Wave Propagation Simulation 172.Mgrid Fortran 77 Multi-grid Solver: 3D Potential Field 173.Applu Fortran 77 Parabolic / Elliptic Partial Differential Equations 188.Ammp C Computational Chemistry 177.Mesa C 3-D Graphics Library 178.Galgel Fortran 90 Computational Fluid Dynamics 179.Art C Image Recognition / Neural Networks 189.Lucas Fortran 90 Number Theory / Primality Testing 301.Apsi Fortran 77 Meteorology: Pollutant Distribution Table 3: Description of Berkeley Multimedia Benchmark suite [18] MULTIMEDIA Ghostscript Mpg123 GSM Dec GSM Enc MPEG2 Dec DVD MPEG2 Enc DVD Description Postscript document viewing/rendering MPEG-1 Layer? (MP3) audio decoder European GSM speech compression decoding European GSM speech compression encoding MPEG-2 DVD video decoding MPEG-2 DVD video encoding 4.2 Simple-Scalar Simulators A micro-architecture simulator is a virtual machine used to simulate the functionality and performance of a target microprocessor and its subsystems. The Simple-Scalar 3.0 tool set [4] is an execution-driven simulator suite. The input to the execution-driven simulator is executable code. The Simple-Scalar 3.0 tool set consists of functional simulators (sim-safe, sim-fast), a timing simulator (sim-outorder), cache 15

33 simulators (sim-cache, sim-cheetah) and a profiling simulator (sim-profile). In this work, two simulators from the suites are selected: sim-outorder and sim-profile to collect the execution statistics for Alpha AXP ISA. We modify them to output statistics every 100 million instructions for SPEC2000 and every one million for Berkeley Multimedia. Sim-outorder is used to generate the simulation output statistics pertaining to hardware performance. It is an out-of order, Superscalar issue simulator that models the timing of instructions as they execute on the simulated processor model. Sim-outorder also supports non-blocking caches and speculative execution. Sim-profile is used to determine the instruction profile (instruction mix) of the benchmarks [4, 9 and 13]. We choose the default micro-architecture configuration for Sim-outorder in our research. It is a generic Superscalar similar to MIPS The default configuration is shown in Table 4. Table 4: Default Micro-architecture Configuration in Simple-Scalar Instruction Decode/Issue B/W Architecture Register Branch Predictor Data Cache (L1) Instruction Cache (L1) Unified L2 Cache Instruction Fetch Queue (IFQ) size Register Update Unit (RUU) size Load/Store Queue (LSQ) size Functional Units memory access bus width out-of-order issue, 4 instructions/cycle 32 bits integer/floating point bimod, BTB size 2k 4k, 32B blocks, 4 way set associative, 1cycle hit latency 16k, 32B blocks, direct mapped, 1 cycle hit latency 64k, 64B b locks, 4 way set associative, 6 cycles hit latency 4 entries 16 entries 8 entries 4 INT ALU, 4 FP ALU, 1 integer MULT/DIV unit, 1 FP MULT/DIV unit 64 bits 16

34 4.3 Statistical Tool SAS SAS is a well known statistical analysis tool [22]. SAS V8.0 is used to statistically analyze the benchmark metrics. The various statistics that we examine are: 1. Correlation: It is a measurement of association between two variables. It measures how strongly the variables are correlated, or change in relation to each other. If two variables increase or decrease together, they are positively correlated. If they change in opposite directions, they are negatively correlated. 2. Principal components analysis (PCA): Principal components analysis is a mathematical method for transforming a set of correlated variables into a smaller set of uncorrelated variables [5]. 3. Basic descriptive statistics and distribution: This includes mean value, maximum value, minimum value, standard error, probability density functions and the like. The analysis of these statistics is described in Section C5.0 Decision Tree Classifier A classifier is a mechanism to group data into different classes. C5.0 is a classifier construction tool that uses decision trees to classify data according to their attributes. C5.0 is a supervised machine learning system. Supervised means the desired outputs (or classes) of individual training inputs are provided by the designer. 17

35 A supervised learning algorithm requires training to be done using a training dataset that specifies the class to which each set of data belongs. A simple example of a binary decision tree is shown in Figure 3. In this example, there are two attribute inputs: IPC and IPB. At the top of the decision tree, if IPC>1, the input data will be classified as Class A. If IPC 1, the decision is based on IPB. If IPB>1, the input data will be classified as Class B, if not, it will be classified as Class C. Decision trees can be substituted by decision rules (Figure 4). Rules are generated from the decision tree but are more easily understood by the users. They are also more accurate than decision trees [10]. Normally with more input attributes (features/metrics), C5.0 generates more rules and the classification is more precise. Eight attributes (the metrics described in Section 3.1) are used as inputs to the C5.0 classifier in this work. The classification error is obtained by the number of misclassified cases divide by the number of whole input cases. Yes IPC >1 No Class A Yes IPB>1 No Class B Class C Figure 3: Simple Binary Decision Tree 18

36 Rules: Rule 1: IPC>1 -> class A Rule 2: IPC<=1 IPB>1 -> class B Rule 3: IPC<=1 IPB<=1 -> class C Figure 4: Rules Generated from the Decision Tree in Figure Sampling and Experimental Methods After examining the IPC behavior of the complete execution for all benchmarks, we set an appropriate sampling size that enables us to clearly observe the steady state behavior. The simulator outputs the metric values after every fixed instruction interval when executing the benchmarks. The values that are recorded reflect the interval rather than the cumulative behavior. An interval is defined to be 100 million instructions in SPEC2000 benchmarks. In Berkeley Multimedia benchmarks, we sample at every one million instruction interval until program completion because they execute a relatively small number of instructions. In this work, we use large sample sizes. Almost 500 observations (data points) are collected for each SPEC2000 benchmark simulation. More than 88 observations 19

37 are collected for each Multimedia benchmark simulation. Several techniques are used to find the distinguishing characteristics between SPEC2000INT, SPEC2000FP and Berkeley Multimedia benchmark suites: 1. IPC behavior: IPC time series behaviors are plotted and examined. Five different types of behavior are classified as: constant, periodic, irregular, constant/periodic and constant/irregular (see Section 5.1). 2. Feature correlation: The correlation between two different features (metrics) is calculated for all of the benchmark suites and for all metric/feature pairs. Highly correlated feature pairs are identified for each benchmark suite (see Section 5.2). 3. Principal Component Analysis (PCA): Principal components are extracted from the eight metrics we collected. The first two principal components are plotted and classified according to their pattern. The first component is analyzed and classified as a time series, and FFT is used on the first component to identify the distinguishing characteristics (see Section 5.3). 4. Instruction and branch composition: The instruction and branch mix of every benchmark is examined. The PDF of instructions and branches is used to examine the differences between benchmark suites (see Section 5.4). 5. Decision tree classification: The decision tree tool C5.0 is selected to classify the benchmark suites. Four metrics are chosen from the original eight metrics collected during simulation as the inputs to an efficient classifier, and the choice is supported by a series of experiments (see Section 6). 20

38 5 CHARACTERIZATION ANALYSIS In this section the results of the various methods used to characterize the benchmarks are reported and analyzed. First, we classify the workloads according to their IPC behaviors. Then we perform the feature correlation analysis for each benchmark suite. We continue with the principal components analysis (PCA) on the collected data. Finally, we describe the differences in instruction and branch mixes between SPEC2000INT, SPEC2000FP and Berkeley Multimedia benchmark suites. 5.1 IPC Behavior IPC (instructions committed per cycle) is a composite measurement of processor performance and it is an important metric used in micro-architecture analysis. The higher the IPC, the higher the processor throughput. By examining IPC behavior, three primary execution phases are observed in the benchmarks: Initialization-Steady-Finish. In the Initialization phase, IPC behavior is not stable and should not be used in research [27]. In the Steady phase, the IPC behavior reflects the true benchmark behavior and it is the part we would like to study. The Finish phase is the final part of the program before the program completion. The remaining IPC behavior types are also studied and classified in the following sections. Most of the benchmarks in SPEC2000 commit more than 50 Billion (50B) instructions, however the total number of instructions committed by Gcc is 46B and Art is 45B, so the number of sampling points for these two benchmarks is less than 21

39 500 (50 billion / 100 million) Initialization When the benchmarks are executing on the processor, the processor needs time to warm up at the beginning of benchmark execution called initialization time (Figure 5). During the Initialization phase, the benchmark behavior is unstable because the components of the micro-architecture are empty at the start of execution. During initialization these structures are filled, then steady state behavior begins. After the Initialization phase, the program execution remains in a steady state until the Finish phase. The Initialization phase should not be included in the dataset to study because it is not indicative of the true or interesting behavior of the program. In previous work [27], the Start-steady points (beginning of Steady phase) of SPEC95 benchmarks are studied. When the IPC shows an observable steady behavior, we define that as the Start-steady point of that benchmark. The Initialization and descriptive IPC statistics are presented in Table Classification of IPC Behavior We classify five types of benchmark IPC behavior: constant, periodic, irregular, constant/periodic and constant/irregular in this work. Short examples of these behaviors are shown in Figures The X-axis is the number of instructions; Y-axis measures the value of metrics (features). IPC and IPB are normalized based on their 22

40 maximum value of all benchmarks. RUUU, IFQU and LSQU are the utilization of RUU, IFQ and LSQ, respectively. They are normalized based on their defined queue sizes; I, D and L2 are cache miss ratios of I-cache, D-cache and L2 cache, respectively. Initialization phase # Instructions Figure 5: Example of initialization phase of Swim These feature behavior plots are used to find the Start-steady point and to study the performance of the benchmarks on the micro-architecture. Some of these plots are truncated and some of them are not. Plots are truncated for ease of display. Table 5 lists the five classes of IPC behavior and statistics for all benchmarks. Start-steady is the beginning point of the Steady phase. For example, the Start-steady point of Swim is five. This means that the IPC begins Steady phase at 500 million 23

41 instructions. Behavior shows the type of IPC behavior in the Steady phase. N is the number of sampling points in the Steady phase; IPC SD is the standard deviation of IPC Mean; 95%CI is the 95% confidence interval of IPC Mean; (CI/IPC)% is obtained from the 95% confidence interval of IPC Mean divided by IPC Mean. The units of Start-steady are 100 million instructions for SPEC CPU2000 and one million instructions for Berkeley Multimedia. The bold numbers in (CI/IPC)% are the numbers less than ten which makes the IPC show Constant behavior. We can observe that the Start-steady point and Behavior vary for each of the benchmark suites from Table 5. To classify the behaviors, we calculate the value of (CI/IPC) because it shows the variability of IPC. We define a threshold of 5% (CI/IPC) to reflect low IPC variability, and a threshold of 10% (CI/IPC) to reflect high IPC variability. These behavior classification rules are shown in Table 6. We define and describe the classification rules as: If (CI/IPC)<5%, the IPC behavior is almost a straight line, so it is classified as "Constant". Twolf, Equake, Ghostscript, Mpg123, Gsm dec and Gsm en fall into this category. If 5%<(CI/IPC)<10%, the IPC behavior is not a straight line, but the IPC variability is low. It shows a combination of two different behaviors. Con/Per" which has constant behavior combined with periodic behavior or "Con/Irre" which has constant behavior combined with irregular behavior. Eon, Mesa and Art fall into these categories. 24

42 Table 5: IPC Behavior Classification Table Program Class Behavior Start steady N IPC Mean IPC SD 95%CI (CI/IPC) % 164.Gzip SPEC2000INT Periodic Gcc SPEC2000INT Periodic Vortex SPEC2000INT Periodic Mcf SPEC2000INT Periodic Bzip2 SPEC2000INT Periodic Swim SPEC2000FP Periodic Mgrid SPEC2000FP Periodic Applu SPEC2000FP Periodic Ammp SPEC2000FP Periodic Galgel SPEC2000FP Periodic Lucas SPEC2000FP Periodic Apsi SPEC2000FP Periodic MPEG2 Dec DVD Multimedia Periodic MPEG2 Enc DVD Multimedia Periodic Crafty SPEC2000INT Irregular Vpr SPEC2000INT Irregular Parser SPEC2000INT Irregular Twolf SPEC2000INT Constant Equake SPEC2000FP Constant Ghostscript Multimedia Constant Mpg123 Multimedia Constant GSM Dec Multimedia Constant GSM Enc Multimedia Constant Eon SPEC2000INT Con/Per Mesa SPEC2000FP Con/Per Art SPEC2000FP Con/Irre

43 If (CI/IPC)>10%, the IPC variability is too high to be categorized as constant behavior. If there is observable periodic behavior in the plot, we classify it as Periodic ; if there is observable irregular behavior in the plot, we classify it as Irregular. Table 6: IPC Behavior Classification Rules (CI/IPC)% Behavior Observation (CI/IPC)<5% Constant observable straight line behavior 5%<(CI/IPC)<10% Con/Per observable periodic behavior, low IPC variability Con/Irre observable irregular behavior, low IPC variability (CI/IPC)>10% Periodic observable periodic behavior, high IPC variability Irregular observable irregular behavior, high IPC variability Specially, Gcc looks like Irregular behavior with high IPC variability. However, it is considered as Periodic behavior but not Irregular behavior because it has two large periods in the entire IPC behavior plot as shown in Figure 11. According to above analysis for IPC behavior, we observe that the SPEC2000FP benchmarks are mostly periodic with high IPC variability. Seven out of the ten SPEC2000FP benchmarks are classified Periodic. Moreover, all the benchmarks classified as Irregular are SPEC2000INT benchmarks and four out of the six Berkeley Multimedia benchmarks are classified as Constant. All these observations show some interesting characteristics in each benchmark suites, and motivate us to take a deep study of the distinguish characteristics. 26

44 # Instructions Figure 6: Periodic IPC Behavior (Swim) # Instructions Figure 7: Irregular IPC Behavior (Parser) 27

45 # Instructions Figure 8: Constant IPC Behavior (Twolf) # Instructions Figure 9: Con/Per IPC Behavior (Mesa) 28

46 # Instructions Figure 10: Con/Irre IPC Behavior (Art) Period 1 Period 2 Figure 11: Entire IPC behavior of Gcc 29

47 5.2 Feature Correlation This section describes the methodology and analysis of the correlation between the features (or metrics) that are collected from Sim-outorder. We present the method to calculate the Pearson correlation coefficient. We continue with an analysis of the standard correlation coefficient. We then perform a statistical analysis called two independent t-test for the mean of corr (MOC) of benchmark suites and the mean of pair corr (MOPC) of highly correlated feature pairs to find the distinguishing characteristics. Finally, we study the relation between IPC behavior and MOC Pearson Correlation Coefficient According to the behavior noted in the eight features we chose to examine (shown in Figures 6-10), we conclude that there is a correlation between these features. Some features increase or decrease together, showing a high positive correlation; some features have opposite behavior (a feature increases while others decrease or vice-versa), showing a high negative correlation; some features are unaffected by each other, showing low correlation (positive or negative). The Pearson correlation coefficient (r) is calculated in this work, which requires both variables to lie on an ordinal scale. The Pearson correlation coefficient lies in the range 1 r 1, where -1 indicates the highest negative correlation and 1 indicates the highest positive correlation. 30

48 Suppose there are two time series variables: x and y. Let x be the mean value of x and y be the mean value of y. Let n be the number of the data points for the variables. The Pearson correlation coefficient of these two variables is computed as [23]: S xy r = (1) S S xx yy S xy n = ( xi i= 1 x)( y i y) (2) S S xx yy = = n i= 1 n i= 1 2 ( x x) (3) i 2 ( y y) (4) i All of the correlation coefficients of feature pairs (28 pairs) are calculated and listed in the table. The feature correlation coefficient tables for the SPEC2000INT, SPEC2000FP and Berkeley Multimedia benchmarks are listed in Appendix A. The example correlation coefficient table for the SPEC2000INT benchmark suite is shown in Table 7. This table shows the Pearson correlation coefficient of the different feature pairs of the SPEC2000INT benchmarks we chose to study. From the table, we see that the correlation coefficient of IFQU-I in Vortex is , indicating that there is a high negative correlation between IFQU and I. The correlation coefficient of D-L2 in Vortex is , indicating that there is a low positive correlation between D and L2 The correlation coefficient table for SPEC2000FP and Berkeley Multimedia is explained in a similar manner. 31

49 Table 7: Correlation Coefficient Table of SPEC2000INT CINT2000 Program IPC- IPB IPC- IFQU IPC- RUUU IPC- LSQU 164.Gzip Gcc Crafty Vortex Twolf Vpr N/A Mcf N/A Parser Eon Bzip N/A IPC- I IPC- D IPC- L2 IPB- IFQU IPB- RUUU IPB- LSQU IPB- I IPB- D IPB- L2 IFQU- RUUU 164.Gzip Gcc Crafty Vortex Twolf Vpr N/A Mcf N/A Parser Eon Bzip N/A IFQU- LSQU IFQU- I IFQU- D IFQU- L2 RUUU- LSQU RUUU- I RUUU- D 164.Gzip Gcc Crafty Vortex Twolf Vpr 0.42 N/A N/A Mcf N/A N/A Parser

50 Table 7. (continued) 252.Eon Bzip N/A N/A RUUU- L2 LSQU- I LSQU- D LSQU- L2 164.Gzip Gcc Crafty Vortex Twolf Vpr 0.96 N/A N/A N/A Mcf 0.70 N/A N/A N/A Parser Eon Bzip N/A N/A N/A 0.04 I- D I- L2 D- L2 33

51 In some benchmarks simulations the I-cache miss ratio is extremely small so it is formatted as zero by the Simple-Scalar. Therefore, in some cases, the correlation coefficients between I-cache and other features are not computed and shown as N/A. The correlation coefficient corresponds to a confidence ellipse which (by specification) contains 95% of the data points. We generate the confidence ellipse in SAS. The smaller the correlation coefficient, the more scattered the data points and the confidence ellipse tends to be circular. The larger the correlation coefficient, the more concentrated the data points and the confidence ellipse approximates a line. For example, in Vortex (shown in Table 7), the correlation coefficient of the IFQU-I is , and its absolute value is the maximum of the feature pair correlation coefficients for that benchmark. Therefore, its confidence ellipse tends to be linear as shown in Figure 12. The correlation coefficient of the D-L2 is , and its absolute value is the minimum of the feature pair correlation coefficients for that benchmark. Therefore, its confidence ellipse is more circular as shown in Figure 13. For highly correlated pairs, we can extract a principal component that represents most of the information contained in that pair. However, for a pair of metrics that are not highly correlated, extracting a principal component results in a loss of information (Principal Components Analysis is described in Section 5.3) Standard Correlation Coefficient To simplify the correlation coefficient data in Table 7, it is transformed to standard correlation coefficient data by applying the following rules: 34

52 Figure 12: Confidence Ellipse of the IFQU-I (Vortex) Figure 13: Confidence Ellipse of the D-L2 (Vortex) 35

53 If the absolute value of the correlation coefficient r 0. 3, the correlation is said to be very weak and is marked as 0. If 0.3 r 0. 6, the correlation is said to be weak and is marked as 0.5. If r 0. 6, the correlation is said to be strong and is marked as 1. If there is no correlation between two features it is marked as N/A. The above thresholds are commonly used in statistics to classify correlation. The standard correlation coefficient table for the SPEC2000INT benchmarks is shown in Table 8. The tables for SPEC2000FP and Berkeley Multimedia can be found in Appendix A. We define the following abbreviations to summarize the data: The pair sum of corr (PSOC) is the sum of the standard correlation coefficients (0, 0.5 or 1) of a particular feature pair. The higher the PSOC, the higher the correlation of that pair. The workload sum of corr (WSOC) is the sum of the standard correlation coefficients of a particular benchmark. The mean of corr (MOC) is the mean of the standard correlation coefficients of a single benchmark or a benchmark suite. The higher the MOC, the higher the correlation between the features in that benchmark or that benchmark suite. MOC of a benchmark is obtained from the WSOC divided by the number of feature pairs. For example, the MOC of Vortex = WSOC / 28 = In some 36

54 cases, the standard correlation coefficient between I-cache and other features is N/A, so the number of effective feature pair is 21 instead of 28. The bold numbers in PSOC indicate the highly correlated pairs. The bold numbers in MOC indicate the maximum and minimum MOC values. For example, the MOC of Crafty is , showing that Crafty is the lowest correlated benchmark in SPEC2000INT. The MOC of Vortex is , showing that Vortex is the highest correlated benchmark in SPEC2000INT. The MOC of SPEC2000INT is calculated as We distinguish highly correlated from low correlated pairs by defining a threshold for the PSOC value. If the PSOC of a specific feature pair is greater than 80% for the total number of benchmarks, then these two features are highly correlated. For example, in SPEC2000INT, the total number of benchmarks is ten, the PSOC of the IPC-RUUU is nine, which is greater than eight (80% of 10). Therefore, the IPC-RUUU is said to be a highly correlated feature pair. We are only interested in the highly correlated features in this work. Features demonstrating low correlation are not studied. The analysis of feature pair correlation for various benchmark suites is summarized below: SPEC2000INT: The highly correlated feature pairs are: IPC-RUUU, IPC-D and IFQU-RUUU. Vortex (periodic) has the highest MOC of and Crafty (irregular) has the lowest MOC of

55 Table 8: Standard Correlation Coefficient Table of SPEC2000INT CINT2000 Program IPC- IPB IPC- IFQU IPC- RUUU IPC- LSQU 164.Gzip Gcc Crafty Vortex Twolf Vpr N/A Mcf N/A Parser Eon Bzip N/A PSOC IPC- I IPC- D IPC- L2 IPB- IFQU IPB- RUUU IPB- LSQU IPB- I IPB- D IPB- L2 IFQU- RUUU 164.Gzip Gcc Crafty Vortex Twolf Vpr N/A Mcf N/A Parser Eon Bzip N/A PSOC IFQU- LSQU IFQU- I IFQU- D 38 IFQU- L2 RUUU- LSQU RUUU- I RUUU- D 164.Gzip Gcc Crafty Vortex Twolf Vpr 0.5 N/A N/A Mcf 1.0 N/A N/A 1.0

56 Table 8. (continued) 197.Parser Eon Bzip2 0.5 N/A N/A 0.5 PSOC RUUU- L2 LSQU- I LSQU- D LSQU- L2 164.Gzip Gcc Crafty Vortex Twolf Vpr 1.0 N/A N/A N/A Mcf 1.0 N/A N/A N/A Parser Eon Bzip2 1.0 N/A N/A N/A 0.0 PSOC I- D I- L2 D- L2 WSOC MOC 164.Gzip Gcc Crafty Vortex Twolf Vpr Mcf Parser Eon Bzip PSOC

57 SPEC2000FP: The highly correlated feature pairs are: IPC-IPB, IPC-D, IPC-L2, IPB-IFQU, IPB-RUUU, LSQU-L2 and D-L2. Equake (constant) has the highest MOC of 1.0 and Lucas (periodic) has the lowest MOC of Berkeley Multimedia: The only highly correlated feature pair is: IFQU-RUUU. Gsm dec (constant) has the highest MOC of and Gsm en (constant) has the lowest MOC of According to the above analysis, we observe that the number of highly correlated feature pairs in SPEC2000FP is larger than in other classes. Therefore, we conclude that the features of SPEC2000FP are more correlated than those in the other benchmark suites. The Berkeley Multimedia suite has only one highly correlated feature pair which is the least number among all the suites Two Independent t-test As shown in the standard correlation tables in Appendix A, the MOC of the benchmark suites are calculated as: SPEC2000INT: SPEC2000FP: Berkeley Multimedia:

58 Using the two independent t-test [23] as shown below, we can determine that the MOC of SPEC2000FP is statistically larger than that of Berkeley Multimedia. However, we can not show that the MOC of SPEC2000FP is statistically larger than that of SPEC2000INT or the MOC of SPEC2000INT is statistically larger than that of Berkeley Multimedia. Let µ 1 be the population MOC of SPEC2000FP; let µ 2 be the population MOC of Berkeley Multimedia. We formulate the null hypothesis, H 0 : µ µ 0 (the MOC 1 2 = of these two benchmark suites are not statistically different), the alternative hypothesis, Ha: µ µ 0 (the MOC of these two benchmark suites are statistically 1 2 different). Let n 1,n2 be the number of benchmarks we choose to study in SPEC2000FP and Berkeley Multimedia, respectively. In this case n 1=10, n 2 = 6. Let y i and y j be the MOC of a single benchmark in SPEC2000FP and Berkeley Multimedia, respectively. Let y 1 be the sample MOC of SPEC2000FP, and y 2 be the sample MOC of Berkeley Multimedia. Pooled standard deviation: Sp 2 n2 j = 1 n1 2 i= 1( yi y1) + = ( n 1) + ( n 1 2 ( y 1) j y 2 ) 2 (5) Test statistics: 41

59 t = ( y1 y2) (6) 2 2 Sp Sp + n n Substituting the values of n 1,n2, y 1, y2 and Sp, the t-value for this case is calculated to be equal to Let the significance level be 0.05, then t, n1 2 2 = +n Therefore, if t > 1.761, t enters the region of rejection and we can reject the null hypothesis; if t 1.761, the null hypothesis can not be rejected. According to the above computation, the t-value is equal to which is greater than 1.761, therefore the null hypothesis is rejected. We can conclude that the MOC of SPEC2000FP is statistically larger than that of Berkeley Multimedia with 95% confidence. Thus it shows that the features of SPEC2000FP are more correlated than those of Berkeley Multimedia. Additional comparison between the MOC of SPEC2000INT and SPEC2000FP results in a computed t value of less than This implies that we fail to reject the null hypothesis. Therefore, we conclude that they are not statistically different. The same result occurs for SPEC2000INT and Berkeley Multimedia Mean of Pair Correlation (MOPC) As described in Section 5.2.2, we observe that there are different highly correlated feature pairs in each benchmark suites. We believe that some of those feature pairs can be shown statistically higher correlated than those of other 42

60 benchmark suites. In order to compare the mean of the standard correlation coefficient of these feature pair, we calculate the mean of pair corr (MOPC) of highly correlated feature pairs as listed in Table 9. The bold number indicates the highly correlated pairs in each benchmark suite. Table 9: Standard Coefficient Table of Highly Correlated Feature Pairs SPEC2000INT IPC- RUUU IPC- D IFQU- RUUU IPC- L2 IPB- IFQU IPB- RUUU LSQU- L2 D- L2 IPC- IPB 164.Gzip Gcc Crafty Vortex Twolf Vpr Mcf Parser Eon Bzip PSOC MOPC SPEC2000FP 171.Swim Equake Mgrid Applu Ammp Mesa Galgel Art Lucas Apsi

61 Table 9. (continued) PSOC MOPC Multimedia Ghostscript N/A N/A N/A 0.50 Mpg Gsm dec Gsm en Mpeg dec Mpeg en PSOC MOPC

62 MOPC is obtained from the PSOC divided by the number of benchmarks in that benchmark suite. We successfully show with 95% confidence using two independent t-test that particular MOPCs of a benchmark suite are statistically larger or smaller than that of another benchmark suite. The MOPC comparisons between the three benchmark suites are listed in Table 10. In the comparison of SPEC2000INT vs SPEC2000FP, the result for the IPC-RUUU listed as large indicates the MOPC of IPC-RUUU for SPEC2000INT is statistically larger than that for SPEC2000FP. Similarly, the result for IPB-RUUU listed as small indicates that the MOPC of IPB-RUUU for SPEC2000INT is statistically smaller than that for SPEC2000FP. We hypothesize that the reason of IPC-RUUU for SPEC2000INT is statistically large than that for SPEC2000FP is because there are more dependencies in integer workloads than FP workloads. These high dependencies make the reservation stations in RUUU are highly utilized then make the IPC more correlated with RUUU than other features (metrics). In contrast, normally there are large data executions in FP programs which make the IPC in FP workloads are highly correlated with D cache miss ratio and L2 cache miss ratio. Moreover, FP programs are loop intensive so they need more branches which make the correlation between IPC-IPB is high. 45

63 Table 10: MOPC Comparison MOPC SPEC2000INT vs SPEC2000FP SPEC2000FP vs Multimedia Multimedia vs SPEC2000INT IPC- RUUU IPC- D IFQU- RUUU IPC- L2 IPB- IFQU IPB- RUUU LSQU- L2 D- L2 IPC- IPB Large Small Small Small Large Small Large Large Large Large Small Benchmark IPC Behavior vs MOC Here we study the relation between IPC behavior and the mean of standard correlation coefficient (MOC). We observe that the IPC behavior of the benchmarks which have maximum MOC is either constant or periodic. The IPC behavior of the benchmarks that have minimum MOC can be irregular, constant, or periodic as shown in Table 11. The bold numbers indicate the maximum and minimum MOC of the benchmark suites. Commonly we believe that the IPC behavior represents other feature behaviors so the correlations between them might be different for different IPC behavior classes. The periodic IPC behavior may correspond to higher correlation than other classes as irregular or constant. However, as we observe that there is no obvious relationship between the maximum/minimum MOC and IPC behavior, we conclude that the amount or level of feature correlation can not be decided by IPC behavior classes. For example, both Gsm dec and Gsm en in the Berkeley Multimedia suite have constant behavior (Figures 14, 15). However, the former has the maximum MOC and the latter has the minimum MOC. 46

64 Table 11: Benchmark IPC Behavior vs MOC Program Class Behavior MOC Max/Min 164.Gzip SPEC2000INT Periodic Gcc SPEC2000INT Periodic Vortex SPEC2000INT Periodic Max 181.Mcf SPEC2000INT Periodic Bzip2 SPEC2000INT Periodic Swim SPEC2000FP Periodic Mgrid SPEC2000FP Periodic Applu SPEC2000FP Periodic Ammp SPEC2000FP Periodic Galgel SPEC2000FP Periodic Lucas SPEC2000FP Periodic Min 301.Apsi SPEC2000FP Periodic MPEG2 Dec DVD Multimedia Periodic MPEG2 Enc DVD Multimedia Periodic Crafty SPEC2000INT Irregular Min 175.Vpr SPEC2000INT Irregular Parser SPEC2000INT Irregular Twolf SPEC2000INT Constant Equake SPEC2000FP Constant 1 Max Ghostscript Multimedia Constant 0.5 Mpg123 Multimedia Constant GSM Dec Multimedia Constant Max GSM Enc Multimedia Constant Min 252.Eon SPEC2000INT Con/Per Mesa SPEC2000FP Con/Per Art SPEC2000FP Con/Irreg

65 # Instructions Figure 14: Feature Behavior of GSM Dec (Berkeley Multimedia) # Instructions Figure 15: Feature Behavior of GSM Enc (Berkeley Multimedia) 48

66 5.3 Principal Components Analysis (PCA) As we described in Section 5.2, correlation exists between the eight metrics measured in our work, which makes it difficult to find workload characteristics that affect the program behavior [2]. Therefore, we calculate and study the principal components because they are uncorrelated (contain no information overlap) and represent almost all of the variance in the original dataset [12]. We try to find the differences and similarities between each benchmark suite by analyzing and classifying the PCA results and previous classification results together. In this section, the method used to calculate the principal components score is introduced. The first and second principal components are plotted and common behavior is noted. Then the first component behavior is analyzed using time series analysis and its frequency is studied using Fast Fourier Transform (FFT). Finally, we form a conclusion with respect to these techniques and their ability to distinguish between benchmark classes Calculating Principal Components The commonly used feature extraction methods include: Principal Components Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminant Analysis (LDA), and Nonlinear Discriminant Analysis (NDA) [19]. Principal Components Analysis (PCA) is one of the feature extraction methods that are widely used in statistical analysis [5, 6]. It is a mathematical method for transforming a set of 49

67 correlated variables into a smaller set of uncorrelated variables. The objectives of PCA are [5]: 1. Reducing the dimension of the dataset by extracting new variables from the correlated variables. 2. Creating new meaningful variables. For the highly correlated variables V1 and V2, PCA finds the representative direction and calculates the first principal component to represent V1 and V2. As shown in Figure 16, the first principal component direction indicates the main trend in the data points and contains most of the information; the second principal component direction only represents a minor trend in the data. V2 First principal component direction Second principal component direction V1 Figure 16: Simple Example of the Principal Components The principal components are the variables extracted from the original dataset that are used to represent the main variance between the original variables without losing 50

68 important information. The principal components corresponding to large eigenvalues (λ ) can represent more information in the dataset than others. The first principal component has the largest eigenvalue ( λ 1 ).The characteristics of the principal component are [5]: 1. Each principal component is uncorrelated to others. 2. The first component (component 1) represents the most information in the original dataset. 3. The other principal components (component 2, component 3 and component 4 ) represent a decreasing percentage of the information in the original dataset. The variance represented by the principal components of Gzip is shown in Figure 17. It shows that the first four components of Gzip represent 99.26% of the total variance, and the representative of the variance decreases for each consecutive principal component. The bars show the percentage of the total variance represented by each component, and the thin line shows the cumulative variance represented. There are two methods used for calculating the principal components of a dataset: (i) the covariance matrix method (equal footing), and (ii) the correlation matrix method (without equal footing). In our experiment, the variables are not on an equal footing. For example, IPC ranges from 0 to 3 while IPB ranges from 0 to 45. It is necessary to use standardized principal components data (by Z scores) in this case [5]. Hence, the correlation matrix method is adopted. 51

69 Figure 17: Variance Represented by the Principal Components of Gzip We assume the correlation matrix is X. It is placed into the equation Xa = λa. The eigenvalues are λ 1 λ 2 λx. The eigenvectors are a 1, a2, a3,, a x. Let x ri be the ith original data for the rth variable. Let µ i be the population mean value of the ith original data, µ i = x i. Let σ ii be the population covariance of the original data. Let z ri be the Z score of the ith original data for the rth variable. Let z r be the Z score matrix for the rth variable. The ith principal component score y ri for the rth variable can be calculated from the equation: 52

70 ri ' i zr, y = a for i=1, 2, X, r=1, 2, N (7) z ri x µ ri i =, for i=1, 2, X, r=1, 2, N (8) σ ii As shown above, eigenvalues λ 1 λ 2 λx correspond to eigenvectors a, 1, a2, a3, a x. Eigenvector represents the relationship between each principal component and all the original variables. The variables which have elements in the eigenvector that is larger in absolute value than the others have a strong relationship with the principal component First Two Principal Components Most of the information in the original data can be represented by the first two principal components. The first component represents more information than the second component. The percentage of information in the original dataset represented by the first two principal components is shown in Table 12. Component1% and Component2% show the percentage of the information represented by each component. Sum% shows the cumulative percentage of the information represented by both components. The bold number in Sum% indicates the maximum and minimum value. In the best case, PC1 and PC2 of Equake represent 99.63% of the information in the original data. In the worst case, PC1 and PC2 of Mpg123 represent 66.28% of the information. The percentage of information is generated by SAS. 53

71 Figures show PCA plots of the first (PC1) and the second (PC2) principal component of Crafty (SPEC2000INT), Applu (SPEC2000FP) and Ghostscript (Berkeley Multimedia). Table 12: Percentage of Information Represented by PC1 and PC2 SPEC2000INT Component 1% Component 2% Sum% 164.Gzip 52.52% 35.89% 88.41% 176.Gcc 48.77% 32.89% 81.66% 186.Crafty 60.77% 14.10% 74.87% 255.Vortex 77.34% 17.37% 94.71% 300.Two lf 52.58% 38.91% 91.49% 175.Vpr 62.37% 28.36% 90.74% 181.Mcf 74.69% 13.04% 87.73% 197.Parser 54.88% 19.96% 74.84% 252.Eon 66.75% 13.97% 80.72% 256.Bzip % 36.30% 81.07% SPEC2000FP 171.Swim 69.57% 18.73% 88.30% 183.Equake 94.62% 5.00% 99.63% 172.Mgrid 54.51% 27.74% 82.25% 173.Applu 49.88% 34.24% 84.13% 188.Ammp 62.08% 28.43% 90.51% 177.Mesa 73.44% 21.65% 95.09% 178.Galgel 46.16% 32.39% 78.55% 179.Art 73.41% 25.22% 98.63% 189.Lucas 46.23% 26.05% 72.28% 301.Apsi 77.63% 13.94% 91.57% Berkeley Multimedia Ghostscript 53.88% 23.49% 77.38% Mpg % 18.52% 66.28% GSM Dec 74.87% 18.13% 93.00% GSM Enc 38.24% 33.25% 71.50% MPEG2 Dec DVD 50.73% 21.04% 71.77% MPEG2 Enc DVD 47.15% 25.34% 72.49% Similar to the plot shown in Figure 18, most of the SPEC2000INT benchmark PCA plots (seven of ten) are characterized by scattered and separated data points 54

72 (These can be seen in Appendix B). We can observe that these data points are spread over a large area. However Mcf, Eon and Bzip2 (all integer workloads) show a combination of loop and scattered behavior (Figure B5, 21, 22). Both Vortex (max sum%) and Parser (min sum%) show scattered behavior although Vortex data points seems to be clustered as shown in Figure 23 and Figure 24. In contrast, most of the SPEC2000FP benchmark PCA plots (six of ten) are characterized by looping data points similar to Figure 19. We can observe that these data points are concentrated on lines or a small area. However Art and Apsi are characterized by scattered data points (Figures 25, B9). Mesa and Galgel show a combination of loop and scatter behavior (Figures 26, 27). Both Equake (max sum%) and Lucas (min sum%) show loop behavior as shown in Figure 28 and Figure 29. Berkeley Multimedia benchmarks behave similarly to SPEC2000INT as evidenced by the data in Figure 20. Their PCA data points are scattered and spread over a large area. Both GSM dec (max sum%) and Mpg 123 (min sum%) show scattered behavior as shown in Figure 30 and Figure 31. As shown in Table 13, the PCA plots of the studied benchmarks are classified into three categories: scattered, loop, and loop/scattered. The IPC behaviors of the benchmarks are also listed in the table. We observe that most of the non-periodic IPC behaviors, such as Irregular, Con/Irreg, and Constant, are classified as scattered. This motivates us to find the correlation between non-periodic IPC behavior and scattered PCA plot. In contrast, almost all of the benchmarks classified as loop and loop/scattered PCA correspond to the IPC behaviors that 55

73 Figure 18: PCA Plot of Crafty (SPEC2000INT) 56

74 Figure 19: PCA Plot of Applu (SPEC2000FP) Figure 20: PCA Plot of Ghostscript (Berkeley Multimedia) 57

75 Figure 21: PCA Plot of Eon (SPEC2000INT) Figure 22: PCA Plot of Bzip2 (SPEC2000INT) 58

76 Figure 23: PCA Plot of Vortex (SPEC2000INT) Figure 24: PCA Plot of Parser (SPEC2000INT) 59

77 Figure 25: PCA Plot of Art (SPEC2000FP) Figure 26: PCA Plot of Mesa (SPEC2000FP) 60

78 Figure 27: PCA Plot of Galgel (SPEC2000FP) Figure 28: PCA Plot of Equake (SPEC2000FP) 61

79 Figure 29: PCA Plot of Lucas (SPEC2000FP) Figure 30: PCA Plot of Gsm dec (Berkeley Multimedia) 62

80 Figure 31: PCA Plot of Mpg 123 (Berkeley Multimedia) Table 13: PCA Plot Classification Program Class PCA Plot IPC Behavior 164.Gzip SPEC2000INT scattered Periodic 176.Gcc SPEC2000INT scattered Periodic 255.Vortex SPEC2000INT scattered Periodic 301.Apsi SPEC2000FP scattered Periodic MPEG2 Dec DVD Multimedia scattered Periodic MPEG2 Enc DVD Multimedia scattered Periodic 186.Crafty SPEC2000INT scattered Irregular 175.Vpr SPEC2000INT scattered Irregular 197.Parser SPEC2000INT scattered Irregular 179.Art SPEC2000FP scattered Con/Irreg 300.Twolf SPEC2000INT scattered Constant Ghostscript Multimedia scattered Constant Mpg123 Multimedia scattered Constant GSM Enc Multimedia scattered Constant GSM Dec Multimedia scattered Constant 63

81 171.Swim SPEC2000FP loop Periodic 172.Mgrid SPEC2000FP loop Periodic 173.Applu SPEC2000FP loop Periodic 188.Ammp SPEC2000FP loop Periodic 189.Lucas SPEC2000FP loop Periodic 183.Equake SPEC2000FP loop Constant 256.Bzip2 SPEC2000INT loop/scattered Periodic 181.Mcf SPEC2000INT loop/scattered Periodic 178.Galgel SPEC2000FP loop/scattered Periodic 252.Eon SPEC2000INT loop/scattered Con/Per 177.Mesa SPEC2000FP loop/scattered Con/Per are periodic such as Periodic and Con/Per. Therefore, we believe that a correlation exists between Periodic IPC behavior and loop PCA. Table 14 summarizes the correlation between the classes of PCA plot and IPC behavior. Table 14: PCA Plot Correlation PCA Plot IPC behavior scattered (15) Periodic (6), Irregular (3), Constant (5), Con/Irreg (1) loop (6) Periodic (5), Constant (1) loop/scattered (5) Periodic (3), Con/per (2) Principal Component one (PC1) Time Series Analysis 64

82 One of the purposes of PCA is to extract the main variance in the original dataset. The first principal component represents most of the variance in the original dataset as described in Section In order to find the distinguishing characteristics that describe the features behavior, we analyze the time series of the first principal component (PC1) for the studied benchmarks. PC1 time series of Crafty, Applu and Ghostscript are shown in Figures As we observe from the following figures, the PC1 time series of Crafty (SPEC2000INT) exhibits non-periodic behavior. The PC1 time series of Applu (SPEC2000FP) exhibits highly periodic behavior. The PC1 time series of Ghostscript (Berkeley Multimedia) is similar to Crafty, showing non-periodic behavior. Crafty PC1 PC # Sampling points Figure 32: PC1 Time Series of Crafty (SPEC2000INT) 65

83 Applu PC1 PC # Sampling points Figure 33: PC1 Time Series of Applu (SPEC2000FP) Ghostscript PC1 PC # Sampling points Figure 34: PC1 Time Series of Ghostscript (Berkeley Multimedia) The PC1 time series data can be summarized as follows: Five SPEC2000INT benchmarks show non-periodic behavior, similar to that of Crafty. With the exception of Eon and Bzip2 which show approximately periodic behavior (Figures 35, 36). They also have loop/scattered PCA plot in contrast to most SPEC2000INT benchmarks that are scattered and have periodic or con/periodic IPC behavior. Gcc, Vortex and Mcf show periodic behavior but their frequencies are low, thus they are considered low frequency periodic (Figures 37-39). 66

84 Eight SPEC2000FP benchmarks show high frequency periodic behavior which is similar to the behavior of Applu. With the exception of Art which shows non-periodic behavior (Figures 40) and Galgel which shows low frequency periodic behavior (Figures 41). Five Berkeley Multimedia benchmarks show non-periodic behavior, similar to the behavior of Ghostscript. Mpeg2 DVD En shows periodic behavior with a low frequency. Therefore it is considered to be low frequency periodic (Figure 42). As shown in Table 15, the PC1 time series of the studied benchmarks are classified into four classes: non-periodic, periodic, low frequency periodic and approximately periodic. We also listed the PCA plot and IPC behavior to study their correlations with PC1 time series. From Table 15, we can make general observations that: 1. Most of the benchmarks that show non-periodic PC1 time series are also classified as scattered PCA, and non-periodic IPC behavior such as Irregular and Constant. 2. Similarly, the table also shows the correlations between the classes of periodic PC1 time series, loop PCA, and periodic IPC behavior. 3. Correlation between the classes of low frequency periodic PC1 time series, scattered PCA and periodic IPC behavior. 4. Correlation between the classes of approximately periodic PC1 time series, loop/scattered PCA and periodic IPC behavior. 67

85 The correlations between those classes are listed in Table 16. As we described at the beginning of this section, PC1 represents the main variance in the original dataset. Although the IPC behaviors vary for benchmarks classified as non-periodic PC1 time series, most of the IPC behaviors (nine out of 11) are non-periodic such as Irregular and Constant. Therefore, we believe the non-periodic PC1 time series indicates the non-periodic characteristic consist in the program feature behavior. Similarly, the classes of periodic and approximately periodic PC1 time series are correlated with Periodic IPC behavior and indicates the high frequency periodic characteristic, and the class of low frequency periodic PC1 time series indicates the low frequency periodic characteristic consist in the program overall feature behavior. Table 15: PC1 Time Series Classification Program Class PC1 Time Series PCA Plot IPC Behavior 164.Gzip SPEC2000INT non-periodic scattered Periodic MPEG2 Dec DVD Multimedia non-periodic scattered Periodic 186.Crafty SPEC2000INT non-periodic scattered Irregular 175.Vpr SPEC2000INT non-periodic scattered Irregular 197.Parser SPEC2000INT non-periodic scattered Irregular 179.Art SPEC2000FP non-periodic scattered Con/Irreg 300.Twolf SPEC2000INT non-periodic scattered Constant Ghostscript Multimedia non-periodic scattered Constant Mpg123 Multimedia non-periodic scattered Constant GSM Enc Multimedia non-periodic scattered Constant GSM Dec Multimedia non-periodic scattered Constant 68

86 171.Swim SPEC2000FP periodic loop Periodic 172.Mgrid SPEC2000FP periodic loop Periodic 173.Applu SPEC2000FP periodic loop Periodic 188.Ammp SPEC2000FP periodic loop Periodic 189.Lucas SPEC2000FP periodic loop Periodic 301.Apsi SPEC2000FP periodic scattered Periodic 177.Mesa SPEC2000FP periodic loop/scattered Con/Per 183.Equake SPEC2000FP periodic loop Constant 176.Gcc 255.Vortex 181.Mcf 178.Galgel MPEG2 Enc DVD SPEC2000INT SPEC2000INT SPEC2000INT SPEC2000FP Multimedia low frequency periodic low frequency periodic low frequency periodic low frequency periodic low frequency periodic scattered scattered loop/scattered loop/scattered scattered Periodic Periodic Periodic Periodic Periodic 252.Eon 256.Bzip2 SPEC2000INT SPEC2000INT approximately periodic approximately periodic loop/scattered loop/scattered Con/Per Periodic Table 16: PC1 Time Series Correlation PC1 Time Series PCA Plot IPC behavior non-periodic (11) scattered (11) periodic (8) low frequency periodic (5) loop (6), loop/scattered (1), scattered (1) scattered (3), loop/scattered (2) approximately periodic (2) loop/scattered (2) Periodic (2), Irregular(3), Constant (5), Con/Irreg(1) Periodic (6), Constant (1), Con/Per (1) Periodic (5) Con/Per (1), Periodic (1) 69

87 Eon PC1 PC # Sampling points Figure 35: PC1 Time Series of Eon (SPEC2000INT) Bzip2 PC1 PC # Sampling points Figure 36: PC1 Time Series of Bzip2 (SPEC2000INT) Gcc PC1 PC # Sampling points Figure 37: PC1 Time Series of Gcc (SPEC2000INT) 70

88 Vortex PC1 PC # Sampling points Figure 38: PC1 Time Series of Vortex (SPEC2000INT) Mcf PC1 PC # Sampling points Figure 39: PC1 Time Series of Mcf (SPEC2000INT) Art PC PC # Sampling points Figure 40: PC1 Time Series of Art (SPEC2000FP) 71

89 Galgel PC PC # Sampling points Figure 41: PC1 Time Series of Galgel (SPEC2000FP) Mpeg2 en PC1 PC # Sampling points Figure 42: PC1 Time Series of Mpeg2 DVD En (Berkeley Multimedia) Principal Component one (PC1) Frequency The Fourier Transform method is often used in data analysis to decompose a signal into additive combinations of scaled sinusoids of different frequencies. The Discrete Fourier Transform (DFT) is a Fourier Transform applied to a discrete data series. If there is a finite-length and discrete sequence x[n], defined for 0 n N 1, its DFT X[k] can be obtained from the equation: 72

90 X[ k] = X ( e jω ) N 1 j2πkn/ N ω = 2 πk / N = x[ n] e n= 0, 0 k N 1 (9) The Fast Fourier Transform (FFT) is an efficient algorithm to compute the DFT of a signal. By applying FFT on the PC1 time series, we can clearly identify the frequency components in the frequency domain that constitute the PC1 time domain signal. By studying the frequency components in benchmark feature behavior, we aim to identify distinguishing characteristics. Examples of PC1 frequency plots for each benchmark suite are shown in Figures In this work, the PC1 data points are not sampled over time but are sampled at fixed instructions intervals. Therefore, we can not give a frequency unit in the frequency spectrum as Hz or MHz because we can not obtain a fixed frequency unit when the original data points are sampled. The DFT at zero frequency X[0] is the DC component, and its magnitude represents the average of the input series. The highest positive (or negative) frequency in the DFT is called the Nyquist frequency. It is equal to half of the sampling frequency [24]. For example, the sampling frequency of Crafty is 500, the Nyquist frequency of Crafty is equal to 250 (Figure 43). If there are no frequencies in the DFT above the Nyquist frequency, the original signal can be exactly reconstructed from the DFT [24]. The high peaks in the PC1 frequency plot indicate the representative frequencies in the PC1 time series, which are characterized by outstanding larger magnitude than others. These representative frequency components represent the characteristic signals 73

91 that comprise the PC1 time series. To classify the PC1 frequency, we define high peaks of the PC1 frequency less than 50 as low frequency; if the high peaks lie beyond 50, the PC1 shows high frequency. The PC1 frequency for each of the benchmarks is classified into four categories: low, high, approximately low and approximately high. The classification rules are defined in Table 17. We consider the obviously low magnitude (small) peaks in PC1 frequency as noise compared to the high peaks signal. Table 17: PC1 Frequency Classification Rules PC1 High Peaks Frequency freq < 50 freq > 50 Classes low approximately low high approximately high Observation characteristic high peaks, low noise no characteristic high peaks, high noise characteristic high peaks, low noise no characteristic high peaks, high noise The PC1 frequency of Crafty (Figure 43) is characterized by a high peak close to zero, indicating that the overall feature behavior of Crafty has low frequency. Therefore, Crafty is classified as low. However, the PC1 frequency of Applu shows two high peaks around frequency 75 and 160 (Figure 44), which indicates that high frequency components constitute the overall feature behavior of Applu. Therefore, Applu is classified as high. We believe that the low class indicates the non-periodic program feature behavior and the high class indicates the periodic program feature behavior of the benchmarks. Moreover, from the PC1 frequency of Crafty and Applu, we observe that the PC1 74

92 frequency of Applu is smoother than Crafty and there are more small peaks in Crafty than Applu. These small peaks throughout the frequency spectrum are indicative of the noise in the PC1 of Crafty, and the noise likely indicates the non-periodic program behavior of the benchmarks. The number and magnitude of the noise influences the representative characteristic of the high peaks then influences the classification of the PC1 frequency. Nyquist frequency DC Figure 43: PC1 Frequency of Crafty (SPEC2000INT) 75

93 Figure 44: PC1 Frequency of Applu (SPEC2000FP) Figure 45: PC1 Frequency of Mpeg2 dec DVD (Berkeley Multimedia) Figure 46: PC1 Frequency of Mpg123 (Berkeley Multimedia) The high peaks in the PC1 frequency of Mpeg2 dec DVD are less than 50, however, they are likely not characteristic because of the high noise (Figure 45). Therefore, PC1 frequency of Mpeg2 dec DVD is classified as approximately low. Similarly, The PC1 frequency of Mpg123 is classified as approximately high 76

94 (Figure 46). The PC1 frequency data can be summarized as follows: Seven SPEC2000INT benchmarks show low frequency similar to Crafty while Parser shows approximately low frequency. The exceptions are Eon and Bzip2 whose PC1 frequency high peaks are larger than 50, and so they are considered as high frequency (Figures 47, 48). Gcc, Vortex and Mcf show low frequency periodic behavior in their PC1 time series plots and their PC1 frequency high peaks are less than 50 (Figures 49-51). Eight SPEC2000FP benchmarks show high frequency similar to Applu while Art shows approximately high frequency (Figure 56). The exception is Galgel whose PC1 shows low frequency (Figure 52). Three Berkeley Multimedia benchmarks (Ghostscript, Mpg123 and Gsm en) have high peaks that lie beyond 50 (Figure D11, 46, D13). However, they are considered as approximately high frequency because of their high noise (Figure 46). Meanwhile, GSM dec and Mpeg2 DVD dec show approximately low frequency (Figure D12, 45) and Mpeg2 DVD en shows low frequency (Figure D14). We can observe that there are correlations between the PC1 frequency classes and previous classification. A general observation shown in Table 18, the PC1 frequency class low generally corresponds to a non-periodic or low frequency periodic PC1 time series and a scattered PCA plot. The PC1 frequency class high 77

95 corresponds mostly to a periodic PC1 time series and a loop PCA plot. Moreover, the two classes of approximately low/high correspond to non-periodic PC1 time series and scattered PCA plot. These correlations between the PC1 frequency and previous classes are listed in Table 19. According to the above analysis on the class correlation and the previous analysis on PC1 time series and PCA plot, we conclude that the high class of PC1 frequency indicates the high frequency periodic characteristic that exists in the behavior of the program features. In contrast, the low, approximately low and approximately high indicate non-periodic or low frequency periodic characteristics that exist in the behavior of the program features. Table 18: PC1 Frequency Classification Program Class PC1 Frequency PC1 Time Series PCA Plot IPC Behavior 164.Gzip SPEC2000INT low non-periodic scattered Periodic 186.Crafty SPEC2000INT low non-periodic scattered Irregular 175.Vpr SPEC2000INT low non-periodic scattered Irregular 300.Twolf SPEC2000INT low non-periodic scattered Constant MPEG2 Enc DVD Multimedia low 176.Gcc SPEC2000INT low 255.Vortex SPEC2000INT low 181.Mcf SPEC2000INT low 178.Galgel SPEC2000FP low low frequency periodic low frequency periodic low frequency periodic low frequency periodic low frequency periodic scattered scattered scattered loop/scattered loop/scattered Periodic Periodic Periodic Periodic Periodic 171.Swim SPEC2000FP high periodic loop Periodic 172.Mgrid SPEC2000FP high periodic loop Periodic 173.Applu SPEC2000FP high periodic loop Periodic 188.Ammp SPEC2000FP high periodic loop Periodic 189.Lucas SPEC2000FP high periodic loop Periodic 183.Equake SPEC2000FP high periodic loop Constant 78

96 301.Apsi SPEC2000FP high periodic scattered Periodic 177.Mesa SPEC2000FP high periodic loop/scattered Con/Per 252.Eon SPEC2000INT high 256.Bzip2 SPEC2000INT high approximately periodic approximately periodic loop/scattered loop/scattered Con/Per Periodic 179.Art Ghostscript Mpg123 GSM Enc SPEC2000FP Multimedia Multimedia Multimedia approximately high approximately high approximately high approximately high non-periodic scattered Con/Irreg non-periodic scattered Constant non-periodic scattered Constant non-periodic scattered Constant 197.Parser MPEG2 Dec DVD GSM Dec SPEC2000INT Multimedia Multimedia approximately low approximately low approximately low non-periodic scattered Irregular non-periodic scattered Periodic non-periodic scattered Constant Table 19: PC1 Frequency Correlation PC1 Frequency PC1 Time Series PCA Plot IPC behavior low (9) high (10) approximately high (4) approximately low (3) non-periodic (4), low frequency periodic (5) periodic (8), approximately periodic (2) scattered (7), loop/scattered (2) loop (6), loop/scattered (3), scattered (1) non-periodic (4) scattered (4) non-periodic (3) scattered (3) Periodic (6), Irregular (2), Constant (1) Periodic (7), Constant (1), Con/Per (2) Constant (3), Con/Irreg (1) Irregular (1), Periodic (1), Constant (1) 79

97 Figure 47: PC1 Frequency of Eon (SPEC2000INT) Figure 48: PC1 Frequency of Bzip2 (SPEC2000INT) 80

98 Figure 49: PC1 Frequency of Gcc (SPEC2000INT) Figure 50: PC1 Frequency of Vortex (SPEC2000INT) 81

99 Figure 51: PC1 Frequency of Mcf (SPEC2000INT) Figure 52: PC1 Frequency of Galgel (SPEC2000FP) Parser of SPEC2000INT and Art of SPEC2000FP show interesting behavior. Their IPC behaviors are irregular ( Irregular for Parser and Con/Irreg for Art) and their PC1 time series show non-periodic behavior (Figures 53, 54). Although their PC1 frequency is similar to other benchmarks in their class ( approximately low for Parser or approximately high for Art), their noise is much higher than others 82

100 (Figures 55, 56). This indicates that these benchmarks are characterized by noisy behavior which explains why Parser and Art have irregular IPC behavior Conclusions for Principal Component Analysis According to the above classification and analysis of the benchmark PCA plots, PC1 time series and PC1 frequency, we finally construct a clustering tree as shown in Figure 57. We believe the PC1 frequency is the most important characteristic in our classification because it classifies the benchmark suites better than other methods. Parser PC1 PC # Sampling points Figure 53: PC1 Time Series of Parser (SPEC2000INT) Art PC PC # Sampling points Figure 54: PC1 Time Series of Art (SPEC2000FP) 83

101 Figure 55: PC1 Frequency of Parser (SPEC2000INT) Figure 56: PC1 Frequency of Art (SPEC2000FP) Therefore, the clustering tree is beginning with the PC1 frequency classes, and then expands with other classes. From the clustering tree we observe that most of the SPEC2000INT benchmarks are classified into the class of low frequency PC1, and most of the SPEC2000FP benchmarks are classified into the class of high frequency PC1. The PCA characteristics of the three benchmark suites can be summarized as follows: 84

102 Seven SPEC2000INT benchmarks are classified as low for PC1 frequency, non-periodic or low frequency periodic for PC1 time series, and scattered or loop/scattered for PCA. Two of the exceptions are Eon and Bzip2 which show high PC1 frequencies similar to the SPEC2000FP benchmarks. Parser shows approximately low PC1 frequency similar to the Berkeley Multimedia benchmarks. 85

103 86

104 Eight SPEC2000FP benchmarks are classified as high for PC1 frequency, periodic for PC1 time series, and loop or loop/scattered for PCA. The exceptions are Galgel and Art. Galgel shows low PC1 frequency similar to the SPEC2000INT benchmarks. Art shows approximately high PC1 frequency similar to the Berkeley Multimedia benchmarks. Apsi is not considered as an exception because it is only partially different on PCA plot. Five Berkeley Multimedia benchmarks are classified as approximately high or approximately low for PC1 frequency, non-periodic for PC1 time series, and scattered for PCA. The exception is Mpeg2 enc DVD which shows low PC1 frequency similar to the SPEC2000INT benchmarks. According to the above analysis, a summary can be made for SPEC CPU2000 and Berkeley Multimedia benchmark suites as shown in Table 20. Table 20: Distinguishing PCA Characteristics of the Benchmark Suites Benchmark suites PC1 Frequency PC1 Time Series PCA Plot SPEC2000INT low (7) SPEC2000FP high (8) periodic (8) Berkeley Multimedia approximately (high/low) (5) non-periodic (4), low frequency periodic (3) scattered (6), loop/scattered (1) loop (6), loop/scattered (1), scattered (1) non-periodic (5) scattered (5) 86

105 As we described at the beginning of this section, PC1 represents the main variance in the data representing the program feature behavior. From the above analysis, we can make a general observation that: for SPEC2000INT the PC1 frequency is low, the PC1 time series is non-periodic or low frequency periodic, and the PCA plot is scattered. Therefore, we believe it shows enough evidence that the SPEC2000INT benchmarks are characterized by non-periodic or low frequency periodic program behavior. Similarly, we conclude that the SPEC2000FP benchmarks are characterized by high frequency periodic program feature behavior. Moreover, whether the Berkeley Multimedia benchmarks PC1 frequency are approximately high or approximately low, they are characterized by approximately non-periodic program behavior due to the high noise in the PC1 frequency. The benchmark suites program behaviors are listed in Table 21. From above analysis we conclude that PC1 frequency is the best factor to distinguish between the SPEC CPU2000 and Berkeley Multimedia benchmark suites. Table 21: Benchmarks Suites Program Behavior from PCA Benchmark suites SPEC2000INT SPEC2000FP Berkeley Multimedia Program behavior non-periodic, low frequency periodic high frequency periodic approximately non-periodic We believe that the non-periodic or low frequency periodic program feature behaviors of the SPEC2000INT benchmarks are probably due to the large number of 87

106 function-calls that are characteristics of integer programs. Different functions might be called in varying order or only repeated with large cycle. Therefore, different functions may be executed in a sampling interval and cause the non-periodic or low frequency periodic program feature behavior. In contrast, the high frequency periodic program feature behavior of the SPEC2000FP benchmarks is probably due to the loop intensive behavior of FP programs. Functions may be repeatedly executed in the sampling interval causing the high frequency periodic program feature behavior. This is further discussed in Section 5.4. Moreover, not as we expect, we observe that most of the benchmarks (five of six) with the Constant IPC behavior are indicated as non-periodic program behavior in the PC1 analysis. We believe that is because the PCA extracts and enlarges the variance that exists in the feature behavior, although the variance is very small in the Constant benchmarks. For example, Twolf in SPEC2000INT is classified as Constant IPC behavior. In Figure 58, we can observe that the values of D (D cache miss ratio) lie in the range (0.086, 0.091) with the width 0.005, and the values of IPC lie in the range ( ) with the width Both of them are so small that the feature behavior of IPC and D are observed as a straight line (Figure 8). However, the correlation coefficient of the pair IPC-D is very high ( ). As we describe at the beginning of this section, the principal component scores are generated from the correlation matrix. If the feature behaviors of Twolf IPC and D 88

107 are perfect constant, there has only one single point on their correlation plot that cause no correlation between these two features. The principal component scores generated from the correlation matrix will be calculated as zero and show a constant PC1 time series. In that case, the constant PC1 time series will exactly represent the constant feature behaviors in the original dataset. However, in this work, the feature behaviors of Twolf are not perfect constant and their tiny variances cause non-zero correlation coefficients calculated. Consequently, the correlation coefficients of difference feature pairs may be various and then cause the non-periodic program feature behavior in the PC1 analysis. Figure 58: Confidence Ellipse of the IPC-D (Twolf) 5.4 Instruction Mix and Branch Mix Comparison 89

108 In this section, we compare the dynamic size and instruction mix and branch mix of each benchmark. Finally, we make a conclusion with respect to the distinguishing characteristics. The reference input set of the SPEC CPU2000 benchmark suites are studied in this section Reference Instructions vs Committed Instructions In an out-of-order processor, instructions are executed out-of-order but committed in program order. Instruction commit is the final step in the instruction execution sequence during which the register file or memory is updated [3]. The total number of committed instructions indicates the dynamic size of the benchmarks. Instruction references refer to the total number of load and store instructions in the program. In this section we compare the ratio of reference instructions/committed instructions of different benchmarks to determine the differences between the benchmark suites. The statistics are presented in Table 22 and Figures As shown in Figures 59 and 60, the Berkeley Multimedia benchmarks commit far fewer instructions than SPEC2000 benchmarks. The average number of committed instructions of the SPEC2000FP benchmarks is larger than both SPEC2000INT and Berkeley Multimedia. However, the percentage of references in the execution trace of all benchmarks is not statistically different between the three suites as shown in Table 22 and Figure 61. In Table 22, Ref (%) measures the ratio of reference instructions/committed instructions for all benchmarks. The bold numbers indicate the average Ref (%) in each benchmark suite. 90

109 Table 22: Total Number of Committed and Reference Instructions SPEC2000INT # committed insn # memory refs Ref (%) 164.Gzip 84,367,400,631 24,773,353, % 176.Gcc 46,918,734,647 24,989,307, % 186.Crafty 191,882,992,072 70,222,383, % 255.Vortex 118,968,292,490 48,215,272, % 300.Twolf 346,485,055, ,856,166, % 175.Vpr 84,068,774,803 37,053,998, % 181.Mcf 61,867,399,707 23,055,555, % 197.Parser 546,749,946, ,509,826, % 252.Eon 80,614,083,009 38,813,996, % 256.Bzip2 108,878,091,819 40,133,615, % Average 167,080,077,099 60,962,347, % SPEC2000FP 171.Swim 225,831,349,755 74,341,590, % 183.Equake 131,518,589,796 58,248,576, % 172.Mgrid 419,156,002, ,909,313, % 173.Applu 223,883,652,171 85,459,067, % 188.Ammp 326,548,908, ,189,330, % 177.Mesa 281,690,637, ,695,830, % 178.Galgel 409,367,101, ,743,041, % 179.Art 45,032,572,036 15,633,096, % 189.Lucas 142,398,812,824 31,507,111, % 301.Apsi 347,922,839, ,508,010, % Average 255,335,046,699 96,123,496, % Berkeley Multimedia Ghostscript 1,046,586, ,186, % Mpg ,686, ,916, % GSM Dec 88,246,063 15,724, % GSM Enc 218,938,443 75,569, % MPEG2 Dec DVD 663,428, ,926, % MPEG2 Enc DVD 10,540,461,982 3,153,390, % Average 1,303,134, ,371, % 91

110 Number of Instructions and References # Instructions (SPEC2000INT/FP) 6.E+11 5.E+11 4.E+11 3.E+11 2.E+11 1.E+11 0.E Gzip 176.Gcc 186.Crafty 255.Vortex 300.Twolf 175.Vpr 181.Mcf 197.Parser Number of committed insn 252.Eon 256.Bzip2 171.Swim 183.Equake 172.Mgrid 173.Applu Number of memory refs 188.Ammp 177.Mesa 178.Galgel 179.Art 189.Lucas 301.Apsi Ghostscript Mpg123 GSM Dec GSM Enc MPEG2 Dec DVD MPEG2 Enc DVD 1.2E E E E E E E+09 # Instructions (BM) SPECINT2000 SPECFP2000 Berkeley Multimedia Figure 59: Total Number of Committed and Reference Instructions Average Number of Insns and Refs Average number of committed insns Average number of ref insns # Instructions (SPEC2000INT/FP) 3.0E E E E E E E+00 SPEC2000INT SPEC2000FP Berkeley Multimedia 2.4E E E E E E E+08 # Instructions (BM) Figure 60: Average Number of Committed and Reference Instructions 92

111 Ref% 60% 50% Ref (%) 40% 30% 20% 10% 0% 164.Gzip 176.Gcc 186.Crafty 255.Vortex 300.Twolf 175.Vpr 181.Mcf 197.Parser 252.Eon 256.Bzip2 171.Swim 183.Equake 172.Mgrid 173.Applu 188.Ammp 177.Mesa 178.Galgel 179.Art 189.Lucas 301.Apsi Ghostscript Mpg123 GSM Dec GSM Enc MPEG2 Dec DVD MPEG2 Enc DVD SPEC2000INT SPEC2000FP Berkeley Multimedia Figure 61: Ratio of Reference instructions/committed instructions Instruction Mix and Branch Mix In this section, we examine and compare the instruction mix and branch mix of all benchmarks. Table 23 shows the probability density function (PDF) of the instruction mix and branch mix that characterize the benchmarks. The bold numbers in each benchmark suites show the observable larger percentage of composition than other benchmark suites. Figures 62 and 63 show the instruction mix; Figures 64 and 65 show the branch mix. The PDF is a mathematical function that describes the probability of a random variable. For example, in this work, the PDF of the load instructions in Gzip is: f (load) = Pr{X= load} = 33.23%, which is the probability density function of the random variable X (X is the instruction type). From Table 23 and Figures 62 and 63, the following observations pertain to the instruction mix: 93

112 SPEC2000INT has a (statistically) larger average percentage of unconditional branches than that of other benchmark suites. It also has a (statistically) smaller average percentage of FP computations than that of SPEC2000FP as expected. SPEC2000FP has a (statistically) larger average percentage of FP computations (as expected) than that of SPEC2000INT, and a (statistically) larger average percentage of conditional branches than that of Berkeley Multimedia. Berkeley Multimedia has a (statistically) larger average percentage of loads and integer computations than that of other benchmark suites. It also has a (statistically) smaller average percentage of stores and branches (conditional and unconditional) than that of others. From Table 23 and Figures 64 and 65, the following observations pertain to the branch mix: SPEC2000INT has a (statistically) larger average percentage of unconditional branches (direct and indirect) than that of other benchmark suites. SPEC2000FP has a (statistically) larger average percentage of conditional direct branches than that of SPEC2000INT. Berkeley Multimedia has a (statistically) larger average percentage of conditional direct branches than that of SPEC2000INT. 94

113 Table 23: PDF of Instructions and Branches Composition PDF (%) of Instruction Mix PDF (%) of Branch Mix SPEC2000INT Load Store Uncond Branch Cond Branch INT Compu FP Compu Cond Direct Uncon Direct Uncon Indirect 164.Gzip Gcc Crafty Vortex Twolf Vpr Mcf Parser Eon Bzip Average SPEC2000FP 171.Swim Equake Mgrid Applu Ammp Mesa Galgel Art Lucas Apsi Average Berkeley Multimedia Ghostscript Mpg GSM Dec

114 Table 23. (continued) GSM Enc MPEG2 Dec DVD MPEG2 Enc DVD Average

115 97

116 98

117 Branch Mix Comparison 100% Cond Direct Uncon Direct Uncon Indirect 80% PDF 60% 40% 20% 0% 164.Gzip 176.Gcc 186.Crafty 255.Vortex 300.Twolf 175.Vpr 181.Mcf 197.Parser 252.Eon 256.Bzip2 171.Swim 183.Equake 172.Mgrid 173.Applu 188.Ammp 177.Mesa 178.Galgel 179.Art 189.Lucas 301.Apsi Ghostscript Mpg123 GSM Dec GSM Enc MPEG2 Dec DVD MPEG2 Enc DVD SPEC2000INT SPEC2000FP Berkeley Multimedia Figure 64: Comparison of Benchmark Branch Mix Branch Mix Comparison (Average) PDF 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% uncond indirect Uncond direct Cond direct SPEC2000INT SPEC2000FP Berkeley Multimedia Figure 65: Comparison of Average Benchmark Branch Mix 99

118 Branches can be separated into two different categories [8]: (1) Type of transfer (conditional or unconditional): A conditional branch will be taken only if the condition is satisfied; an unconditional branch will be taken without any condition. (2) Type of target address generation (direct or indirect): the target address of a direct branch is assigned inside the instruction; however, an indirect branch has to generate the target address at run-time. As described in the Alpha ISA, conditional direct branches ( conditional branches ) are used to transfer control flow to either the next instruction (not taken) or to a remote backward/forward location (taken) based on the evaluation of the condition. Conditional indirect branches are not supported in the Alpha ISA [1, 8]. Unconditional direct branches ( unconditional branches ) jump to a single target address without any condition. Unconditional indirect branches ( jumps ) typically use the contents of a register, sometimes combined with a fixed offset to calculate the target address at run-time [1, 8]. For example, BEQ R2, Loop (branch to Loop if the register R2 is equal to zero) is a conditional direct branch; JMP R2, R31 (after the updated PC is written into the register R2, branch to the target address calculated from the register R31) is an unconditional indirect branch; BR R2, Loop (after the updated PC is written into the register R2, branch to Loop) is an unconditional direct branch. 100

119 The unconditional branch (direct and indirect) composition accounts for a larger percentage of the branch mix for SPEC2000INT compared to SPEC2000FP, which can be statistically demonstrated by two independent t-test as described in Section 5.2. In the operation of unconditional branches (direct and indirect) in the Alpha ISA (shown as below), the PC of the instruction following the unconditional branch (updated PC) is stored into the return address register (Ra) before the branch is taken [1]. This step is commonly used on function calls to save the return address of the caller. We believe that the large unconditional branch composition in the branch mix is evidence that integer programs are more procedure call intensive. Unconditional direct branch: Unconditional indirect branch: The conditional direct branch composition accounts for a larger percentage of the branch mix for SPEC2000FP compared to SPEC2000INT, which can be statistically demonstrated by two independent t-test. Conditional direct branches do not save the return address of the caller function into the return address register (Ra). However, Ra is used to test if the required condition is satisfied or not [1]. Moreover, unlike the indirect target address generated at run-time, the direct target address is previously assigned in the program and will not be changed when the program 101

120 executes, which makes the program loop possible. We believe that the large conditional direct branch composition in the branch mix is evidence that floating-point programs tends to be loop intensive. Conditional direct branch: It is generally believed that integer programs have more integer computation than FP computation, and FP programs have more FP computation than integer computation. However this statement is not always true according to our observation. From Table 23, we can observe that the percentage of integer computations in the SPEC2000INT benchmarks is not always larger than the floating-point computations (not as we expect). For example, Parser, Eon and Bzip2 show a low percentage of integer computation. Similarly, the percentage of the FP computation in the SPEC2000FP benchmarks is also not always larger than the integer computations (not as we expect). For example, Equake, Applu and Art show a low percentage of FP computations. Therefore, we conclude that the percentages of integer and FP computations are not good metrics to distinguish SPEC2000INT from SPEC2000FP benchmark suites. 102

121 The characteristic of a large percentage of conditional direct branches in the branch mix (some of them are even close to 98%) of the SPEC2000FP and Berkeley Multimedia benchmarks, in conjunction with the large percentage of unconditional branches in the SPEC2000INT benchmarks may be a characteristic that distinguishes between these benchmark classes. However, not all of the benchmarks follow this rule. For example, Mesa in SPEC2000FP has a larger percentage of unconditional branches than conditional direct branches. Meanwhile Gcc and Vpr in SPEC2000INT have a larger percentage of conditional direct branches than unconditional branches. 103

122 6 C5.0 DECISION TREE CLASSIFICATION Classification is a statistical method used to categorize data or objects of the same type based on the similarity of characteristics. Classification tools implement the classification process with the designed algorithms or methods. Some commonly used classification methods include: clustering, neural network, support vector machine, and decision tree. In this work, we choose C5.0 to construct our classifier. C5.0 is a decision tree classification tool as described in Section 4.3. It is used to classify the performance data and to construct a classifier expressed as decision tree or sets of rules [10]. Currently there are two types of machine learning: supervised learning and unsupervised learning. C5.0 constructs a supervised learning classifier which needs a teacher to give the desired output classes for training example. Unsupervised learning, for example, Neural Network, is competitive and automatically classifies data based on winner-takes-all strategy, the weights of the winner become stronger and the loser become weaker, so the winner will be more sensitive to the inputs. 6.1 Classifier with Eight Features Input We construct a classifier with eight input attributes that are the eight features (IPC, IPB, RUUU, IFQU, LSQU, I, D and L2) sampled from ten SPEC2000INT, ten SPEC2000FP and six Berkeley Multimedia benchmarks. 90% of the dataset is randomly chosen to train the classifier and 10% is used to evaluate it. The decision 104

123 rules, which are generated from the decision tree, are used to construct the classifier [10]. The best classifier generated from our dataset has a training error of 0.1% and an evaluation error of 0.1%. The error is generated from the number of misclassified cases divided by total number of cases in the training and evaluation dataset. The rules and error rate generated for this classifier are shown below. Rules: Rule 1: (9692/5501, lift 1.3) D-cache miss rate > > class int [0.432] Rule 2: (3276/147, lift 2.8) D-cache miss rate <= > class media [0.955] Rule 3: (3018/1, lift 3.1) IPC <= IPB > RUU occupancy > LSQ occupancy > D-cache miss rate > > class fp [0.999] Rule 4: (2720, lift 2.9) IPC > IPB > LSQ occupancy <= > class media [1.000] Rule 5: (791/1, lift 3.1) IPC <= IPB > RUU occupancy > I-cache miss rate > D-cache miss rate > > class fp [0.997] Rule 6: (630, lift 2.9) IPB > L2-cache miss rate <= > class media [0.998] Rule 7: (471, lift 3.1) 105

124 IPB > L2-cache miss rate > > class fp [0.998] Rule 8: (272, lift 2.9) IPB > LSQ occupancy <= > class media [0.996] Rule 9: (94, lift 3.0) IPB <= D-cache miss rate <= L2-cache miss rate > > class int [0.990] Rule 10: (3010/18, lift 3.0) IPB <= D-cache miss rate > L2-cache miss rate <= > class int [0.994] Rule 11: (146, lift 3.0) RUU occupancy <= D-cache miss rate > > class int [0.993] Rule 12: (2637/167, lift 2.9) IPC <= IPB > LSQ occupancy > D-cache miss rate > > class fp [0.936] Rule 13: (2172, lift 2.9) IPC > IPB > RUU occupancy > > class media [1.000] Rule 14: (75/9, lift 2.5) LSQ occupancy <= > class media [0.870] Rule 15: (66, lift 2.9) IPB > LSQ occupancy <= > class media [0.985] Rule 16: (720/10, lift 3.1) IPC <= IPB > LSQ occupancy > D-cache miss rate >

125 D-cache miss rate <= L2-cache miss rate > > class fp [0.985] Rule 17: (8, lift 2.7) IPC <= IPB > IPB <= RUU occupancy <= I-cache miss rate <= L2-cache miss rate > > class int [0.900] Rule 18: (81, lift 2.9) IPC > IPB > IPB <= L2-cache mis s rate <= > class media [0.988] Rule 19: (354/1, lift 3.1) IPB <= IFQ occupancy > RUU occupancy <= D-cache miss rate > > class fp [0.994] Rule 20: (425/1, lift 2.9) IPB > IPB <= RUU occupancy > RUU occupancy <= I-cache miss rate <= > class media [0.995] Default class: media Evaluation on training data (12968 cases): Rules No Errors 20 19( 0.1%) << (a) (b) (c) <-classified as (a): class int (b): class fp (c): class media 107

126 Rule utility summary: Rules Errors ( 8.8%) ( 0.9%) ( 0.3%) Evaluation on test data (1441 cases): Rule utility summary: Rules No Errors 20 2( 0.1%) << (a) (b) (c) <-classified as (a): class int 434 (b): class fp (c): class media Rules Errors ( 8.2%) ( 1.2%) ( 0.3%) As we describe above, when C5.0 constructs a classifier, it randomly chooses 90% of the data for training and 10% for evaluation. The data used for testing is not used to construct the classifier. On subsequent runs using the same data, C5.0 may construct a different classifier which has lower or higher error rate for each run because each time the data used for training and evaluation may be differently segmented [10]. To get a more reliable estimation of the classifier accuracy, C5.0 implements a process called Cross-validation to obtain the mean error rate over a specified 108

127 number of runs. Cross-validation is a method used to estimate the accuracy of a classification or regression model. It divides the data set into several parts, and each part in turn is used to evaluate the classifier constructed from the remaining parts. In this case the dataset is divided into ten parts of the same size (around 1500 cases per part). A total of ten classifiers are constructed and each time one part in turn is used for evaluation and the remaining nine parts are used for training to build decision rules. The mean error rate is obtained from the sum of the ten classifier error rates divided by ten. As shown below, the mean error rate (Mean) of the classifiers constructed from our sample dataset is 0.3%; the standard error (SE) of the estimated mean error rate is 0.0%. We can conclude that by using C5.0 we can construct a precise classifier to classify SPEC2000INT, SPEC2000FP and Berkeley Multimedia benchmark suites. Fold Rules No Errors % % % % % % % % % % Mean % SE % (a) (b) (c) <-classified as (a): class int 109

128 (b): class fp (c): class media 6.2 Justification for Four Important Features From the rules, we can extract important features that distinguish the benchmarks suites. Of the eight features we choose to examine, IPC represents the overall performance of the system, IPB represents control flow frequency, RUU occupancy indicates utilization of the register update unit which play important roles in instruction execution, and D-cache miss ratio measures the average percentage of data miss hit. These four features may distinguish between benchmark classes because we believe they represent the main behavior of the program. Moreover, we observe that there are high correlations between IFQU, RUUU and LSQU from the standard correlation coefficient table (Table 24). These high correlations make it possible to choose RUUU to present IFQU and LSQU. Although I-cache, D-cache, L2-cache miss ratio are not as highly correlated as IFQU, RUUU and LSQU, we observe that I-cache and L2-cache miss ratio are extremely small in some benchmarks so they are considered as zero. Therefore, we believe I-cache and L2-cache miss ratio are not good choices as the inputs to a classifier. To support our hypothesis, we examine the Cross-validation mean error rate with a series of experiments: 110

129 Table 24: Standard Correlation Coefficient Table of (1) IFQU, RUUU and LSQU, (2) I, D and L2 IFQU- IFQU- RUUU- I- I- D- SPEC2000INT RUUU LSQU LSQU D L2 L2 164.Gzip Gcc Crafty Vortex Twolf Vpr N/A N/A Mcf N/A N/A Parser Eon Bzip N/A N/A 0 PSOC SPEC2000FP 171.Swim Equake N/A N/A Mgrid Applu Ammp Mesa Galgel N/A N/A Art N/A N/A Lucas N/A N/A Apsi PSOC Berkeley Multimedia Ghostscript N/A N/A Mpg GSM Dec GSM Enc MPEG2 Dec DVD MPEG2 Enc DVD PSOC

130 (1) The first experiment uses IPC, IPB, RUUU and D-cache miss ratio as the classifier input features. The mean error rate generated from the Cross-validation process is 0.3%, which is the same as the result we obtain from C5.0 using eight features. Attributes includes: IPC, IPB, RUUU and D. Fold Rules No Errors % % % % % % % % % % Mean % SE % (a) (b) (c) <-classified as (a): class int (b): class fp (c): class media The statistics of the most precise classifier which can be constructed from these four features is shown below. Evaluation on training data (12968 cases): Rules No Errors 21 21( 0.2%) << 112

131 (a) (b) (c) <-classified as (a): class int (b): class fp (c): class media Rule utility summary: Rules Errors ( 6.7%) ( 1.4%) ( 0.2%) Evaluation on test data (1441 cases): Rules No Errors 21 2( 0.1%) << (a) (b) (c) <-classified as (a): class int (b): class fp (c): class media Rule utility summary: Rules Errors ( 6.0%) ( 1.2%) ( 0.1%) (2) In the second experiment we use a feature set comprised of IFQU, LSQU, I-cache miss ratio and L2-cache miss ratio as the classifier inputs, and the mean error 113

132 rate is remarkably increased to 1.0%. Attributes includes: IFQU, LSQU, I and L2. Fold Rules No Errors % % % % % % % % % % Mean % SE % (a) (b) (c) <-classified as (a): class int (b): class fp (c): class media (3) The third experiment substitutes one of IPC, IPB, RUUU and D-cache miss ratio with other features in turn to examine the cross-validation mean error rate of other four-feature combinations. Substitute D with I or L2: Attributes includes: IPC, IPB, RUUU and I. Mean error rate: 0.6% Attributes includes: IPC, IPB, RUUU and L2. Mean error rate: 0.6% The above results indicate that using the D in combination with IPC, IPB and RUUU results in higher classification accuracy than using I or L2. We test the effect 114

133 of RUUU on classification accuracy in a similar manner. Substitute RUUU with LSQU or IFQU: Attributes includes: IPC, IPB, LSQU and D. Mean error rate: 0.5% Attributes includes: IPC, IPB, IFQU and D. Mean error rate: 0.5% Examining the classifier accuracy without IPC or IPB input, the Mean error rates are also increased. Substitute IPC with one feature randomly chosen from IFQU, LSQU, I and L2: Attributes includes: IPB, RUUU, LSQU and D. Mean error rate: 0.4% Substitute IPB with one feature randomly chosen from IFQU, LSQU, I and L2: Attributes includes: IPC, RUUU, LSQU and D. Mean error rate: 0.8% (4) Finally, we examine the classifier with combinations of only three input features by taking away one feature in turn from IPC, IPB, RUUU and D: Without IPC: Attributes includes: IPB, RUUU and D. Mean error rate: 0.6% Without IPB Attributes includes: IPC, RUUU and D. Mean error rate: 1.1% Without RUUU: 115

134 Attributes includes: IPC, IPB and D. Mean error rate: 0.4% Without D: Attributes includes: IPC, IPB and RUUU. Mean error rate: 0.8% The above results indicate that the classifier accuracy is decreased with only three feature inputs. Commonly as the number of input features decreases, the classifier accuracy will decrease. Therefore, we conclude that four-feature inputs are the least number of features to construct a precise classifier. The above experiment results are summarized in Table 25. The bold numbers indicate the least mean error rate. Table 25: Mean Error Rate of the Classifiers Feature selected Mean error rate 8 Features input IPC, IPB, RUUU, IFQU, LSQU, I, D and L2 0.30% 4 Features input 3 Features input IPC, IPB, RUUU and D 0.30% IFQU, LSQU, I and L2 1.00% IPC, IPB, RUUU and I 0.60% IPC, IPB, RUUU and L2 0.60% IPC, IPB, LSQU and D 0.50% IPC, IPB, IFQU and D 0.50% IPB, RUUU, LSQU and D 0.40% IPC, RUUU, LSQU and D 0.80% IPB, RUUU and D 0.60% IPC, RUUU and D 1.10% IPC, IPB and D 0.40% IPC, IPB and RUUU 0.80% 116

135 With above experiments and analysis, we conclude that IPC, IPB, RUUU and D-cache miss rate can represent the important behavior of different benchmark classes and can be used to construct an accurate classifier using C

136 7 CONCLUSION AND FUTURE WORK This work examines techniques for characterization and classification of different benchmarks used in modern computer system performance analysis. The performance statistics of Simple-Scalar for the SPEC CPU2000 and the Berkeley Multimedia benchmark suites are analyzed using various methods. The results show that through the analysis of IPC behavior, feature correlation, principal components and instruction mix of these benchmark suites, some distinguishing differences can be observed. 7.1 Conclusion Our primary conclusions regarding this work are the following: IPC behavior: The benchmarks of the SPEC CPU2000 and Berkeley Multimedia suites show various IPC behavior. Five types of IPC behavior are classified: periodic, irregular, constant, periodic/constant and irregular/constant. Each benchmark suite includes some of these IPC behavior classes. Therefore, IPC behavior does not distinguish classes. Feature correlation: The feature correlation in SPEC20000FP benchmarks is higher than that of other benchmark suites. We successfully demonstrate that the feature 118

137 correlation of SPEC2000FP is statistically higher than that of Berkeley Multimedia with 95% confidence using two independent t-test. The mean of pair corr (MOPC) of the highly correlated feature pairs (in each benchmark suite) is statistically different between the benchmark suites. The amount or level of feature correlation can not be decided by IPC behavior. Principal Component Analysis (PCA): Principal component one (PC1) frequency is the best factor that can be used to distinguish the SPEC CPU2000 and Berkeley Multimedia benchmark suites. PC1 frequency is classified into four categories: low, high, approximately low and approximately high. These categories represent the PC1 frequency level. The low PC1 frequency of the SPEC2000INT benchmarks indicates that SPEC2000INT is characterized by non-periodic or low frequency periodic program feature behavior. We believe that it is probably due to the function-call intensity in the integer programs. The high PC1 frequency of the SPEC2000FP benchmark indicates that SPEC2000FP is characterized by high frequency periodic program feature behavior. We believe that it is probably due to the loop intensity in the FP programs. The approximately high or approximately low PC1 frequency of the Berkeley Multimedia benchmarks indicates that Berkeley Multimedia is 119

138 characterized by approximately non-periodic program feature behavior due to the high noise in the PC1 frequency. Instruction and branch mix: SPEC2000INT has a (statistically) larger average percentage of unconditional branches (direct and indirect) than that of other benchmark suites in both the instruction and branch mix. We believe it is evidence that integer programs are more procedure call intensive. SPEC2000FP has a (statistically) larger average percentage of conditional direct branches in the branch mix. We believe it is evidence that floating-point programs tends to be loop intensive. Berkeley Multimedia has a (statistically) larger average percentage of loads and integer computations, and a (statistically) smaller average percentage of stores and branches than that of other benchmark suites in the instruction mix. Moreover, it has a (statistically) larger average percentage of conditional direct branches in the branch mix. C5.0 classification: IPC, IPB, RUUU and D-cache miss rate can represent the important behavior of different benchmark suites and can be used to construct an accurate classifier using C5.0. The mean error rate of the classifier constructed from these features is calculated as 0.3%. 120

139 The above summarized characteristics of SPEC CPU2000 and Berkeley Multimedia are close to what their benchmarks usually behave according to what we know, therefore, some benchmarks can be explained some can not by these characteristics (Table 26). However, these characteristics are useful to classy most of the benchmarks. For Principal Component Analysis: 1) Most of the SPEC2000INT benchmarks have low PC1 frequencies which indicate that SPEC2000INT is characterized by non-periodic or low frequency periodic program feature behavior. However, there are three exceptions. Eon and Bzip2 show high PC1 frequencies similar to the SPEC2000FP benchmarks. Parser shows approximately low PC1 frequency similar to the Berkeley Multimedia benchmarks. 2) Most of the SPEC2000FP benchmarks have high PC1 frequencies which indicate that SPEC2000FP is characterized by high frequency periodic program feature behavior. There are two exceptions, Galgel and Art. Galgel has low PC1 frequency similar to SPEC2000INT benchmarks. Art shows approximately high PC1 frequency similar to the Berkeley Multimedia benchmarks. 3) Most of the Berkeley Multimedia benchmarks have approximately high or approximately low PC1 frequencies. The exception is Mpeg2 enc DVD which shows low PC1 frequency similar to the SPEC2000INT 121

140 benchmarks. For instruction and branch mix: 1) Most of the SPEC2000INT benchmarks have (statistically) larger average percentages of unconditional branches (direct and indirect) than that of other benchmarks. However, the exceptions are Gcc and Vpr which have a larger average percentage of conditional direct branches than unconditional branches. 2) Most of the SPEC2000FP benchmarks have (statistically) larger average percentages of conditional direct branches in the branch mix. The exception is Mesa which has a larger average percentage of unconditional branches than conditional direct branches. 3) Berkeley Multimedia benchmarks have no exception in this case. Table 26: The Exceptions in Each Benchmark Suite Exceptions to Normal Behavior Benchmark Suites PCA Instruction and Branch Mixes SPEC2000INT Eon, Bzip2,Parser Gcc, Vpr SPEC2000FP Galgel, Art Mesa Berkeley Multimedia Mpeg2 enc DVD None As we summarized, SPEC2000INT is characterized by non-periodic or low frequency periodic program behaviors and a high unconditional branch percentage. 122

141 SPEC2000FP is characterized by high frequency periodic program behavior and a high conditional direct branch percentage. We conclude that these characteristics of the PC1 behavior and the branch composition of the SPEC CPU2000 benchmarks are correlated with their program characteristics: integer programs are procedure call intensive and FP programs are loop intensive. Function-call program needs unconditional branches to save the return address. Different functions might be called in varying order or only repeated with large cycle which causes the non-periodic or low frequency periodic program behaviors in SPEC2000INT. Loop program needs conditional direct branches to implement the repeating loop. Functions may be repeatedly executed in the sampling interval then causes the high frequency periodic program behavior in SPEC2000FP. The main differences between SPEC2000INT and SPEC2000FP rest on their program characteristics. Therefore, the continuous efforts should be made to find the characteristics of the codes inside the program. In this work, totally five different techniques are studied for dynamical workload classification. These techniques are list in Table 27 in the order of accuracy and feasibility of implement. According to our experimental results, we conclude that the Decision Tree Classification has the highest accuracy than other techniques and the IPC Behavior Classification has the lowest accuracy. Instruction and Branch Composition has the highest implement feasibility because counting instructions and calculating the instructions PDF are easier implemented than other techniques. In contrast, PCA has the lowest implement feasibility because it has to generate the 123

142 correlation coefficient which needs huge data calculation to obtain accuracy. The calculations for the principal component scores and their time series frequency are also difficult to implement. As we can observe from Table 27, Decision Tree Classification is good at both its accuracy and feasibility of implement. If we can simplify the decision rules and at the same time keep an acceptable accuracy. Decision Tree Classification will be the best technique for dynamical workload classification. Table 27: Techniques for Dynamical Workload Classification Accuracy Feasibility of Implement 1 (highest) Decision Tree Classification Instruction and Branch Composition 2 Instruction and Branch Composition Decision Tree Classification 3 Principal Component Analysis IPC Behavior Classification 4 Feature Correlation Feature Correlation 5 (lowest) IPC Behavior Classification Principal Component Analysis 7.2 Future Work Simplifying the decision tree rules to develop an accurate and also easily implemented classifier is the first step we should work on for dynamical workload classification. On the other hand, Instruction and Branch Composition is also a good technique if we can make it more accurate. For PCA, it may also be used for dynamical classification if we can simplify its calculation. 124

143 A few steps may be done to help us to find and understand the distinguish characteristics in each benchmark suite. In stead of supervised learning, using an unsupervised learning classifier, we can let the classifier decide what the classes are, how many classes, and how to classify according to learning process. Using this method, we may be able to see if the benchmarks studied in this work can be classified to different classes instead of SPEC CPU2000 and Berkeley Multimedia. Further, time series analysis can be used to identify the nature of phenomenon represented by the series plot and predict the future value. Autocorrelation and ARIMA methodology can be used to find the periodic pattern then identify the periodic behavior of benchmarks instead of visual observation in further research. Finally, we believe that the feature behavior is heavily influenced by the manner in which the benchmark is implemented at the code-level. Therefore, the characteristics in the code-level implementation of different benchmarks should be studied to obtain a deep understanding of the distinguishing characteristics of the benchmark suites. 125

144 APPENDICES

145 A CORRELATION COEFFICIENT TABLE AND STANDARD CORRELATION COEFFICIENT TABLE A.1 Correlation Table of the SPEC2000INT Benchmarks Table A1: Correlation Coefficient Table of SPEC2000INT CINT2000 Program IPC- IPB IPC- IFQU IPC- RUUU IPC- LSQU IPC- I IPC- D IPC- L2 164.Gzip Gcc Crafty Vortex Twolf Vpr N/A Mcf N/A Parser Eon Bzip N/A IPB- IFQU IPB- RUUU IPB- LSQU IPB- I IPB- D IPB- L2 IFQU- RUUU 164.Gzip Gcc Crafty Vortex Twolf Vpr N/A Mcf N/A Parser Eon Bzip N/A IFQU- IFQU- IFQU- IFQU- RUUU- RUUU- RUUU- LSQU I D L2 LSQU I D 127

146 Table A1. (continued) 164.Gzip Gcc Crafty Vortex Twolf Vpr 0.42 N/A N/A Mcf N/A N/A Parser Eon Bzip N/A N/A RUUU- L2 LSQU- I LSQU- D LSQU- L2 I- D I- L2 D- L2 164.Gzip Gcc Crafty Vortex Twolf Vpr 0.96 N/A N/A N/A Mcf 0.70 N/A N/A N/A Parser Eon Bzip N/A N/A N/A 0.04 Table A2: Standard Correlation Coefficient Table of SPEC2000INT CINT2000 Program IPC- IPB IPC- IFQU IPC- RUUU IPC- LSQU IPC- I IPC- D IPC- L2 164.Gzip Gcc Crafty Vortex Twolf

147 Table A2. (continued) 175.Vpr N/A Mcf N/A Parser Eon Bzip N/A PSOC IPB- IFQU IPB- RUUU IPB- LSQU IPB- I IPB- D IPB- L2 IFQU- RUUU 164.Gzip Gcc Crafty Vortex Twolf Vpr N/A Mcf N/A Parser Eon Bzip N/A PSOC IFQU- LSQU IFQU- I IFQU- D IFQU- L2 RUUU- LSQU RUUU- I RUUU- D 164.Gzip Gcc Crafty Vortex Twolf Vpr 0.5 N/A N/A Mcf 1.0 N/A N/A Parser Eon Bzip2 0.5 N/A N/A 0.5 PSOC

148 Table A2. (continued) RUUU- L2 LSQU- I LSQU- D LSQU- L2 I- D I- L2 D- L2 164.Gzip Gcc Crafty Vortex Twolf Vpr 1.0 N/A N/A N/A Mcf 1.0 N/A N/A N/A Parser Eon Bzip2 1.0 N/A N/A N/A 0.0 PSOC WSOC MOC 164.Gzip Gcc Crafty Vortex Twolf Vpr Mcf Parser Eon Bzip PSOC

149 A.2 Correlation Table of the SPEC2000FP Benchmarks Table A3: Correlation Coefficient Table of SPEC2000FP CFP2000 Program IPC- IPB IPC- IFQU IPC- RUUU IPC- LSQU IPC- I IPC- D IPC- L2 171.Swim Equake N/A Mgrid Applu Ammp Mesa Galgel N/A Art N/A Lucas N/A Apsi IPB- IFQU IPB- RUUU IPB- LSQU IPB- I IPB- D IPB- L2 IFQU- RUUU 171.Swim Equake N/A Mgrid Applu Ammp Mesa Galgel N/A Art N/A Lucas N/A Apsi IFQU- LSQU IFQU- I IFQU- D IFQU- L2 RUUU- LSQU RUUU- I RUUU- D 171.Swim Equake 0.98 N/A N/A Mgrid

150 Table A3. (continued) 173.Applu Ammp Mesa Galgel 0.84 N/A N/A Art 0.84 N/A N/A Lucas N/A N/A Apsi RUUU- L2 LSQU- I LSQU- D LSQU- L2 I- D I- L2 D- L2 171.Swim Equake N/A N/A N/A Mgrid Applu Ammp Mesa Galgel 0.35 N/A N/A N/A Art N/A N/A N/A Lucas 0.53 N/A N/A N/A Apsi Table A4: Standard Correlation Coefficient Table of SPEC2000FP CFP2000 Program IPC- IPB IPC- IFQU IPC- RUUU IPC- LSQU IPC- I IPC- D IPC- L2 171.Swim Equake N/A Mgrid Applu Ammp Mesa Galgel N/A Art N/A

151 Table A4. (continued) 189.Lucas N/A Apsi PSOC IPB- IFQU IPB- RUUU IPB- LSQU IPB- I IPB -D IPB- L2 IFQU- RUUU 171.Swim Equake N/A Mgrid Applu Ammp Mesa Galgel N/A Art N/A Lucas N/A Apsi PSOC IFQU- LSQU IFQU- I IFQU- D IFQU- L2 RUUU- LSQU RUUU- I RUUU- D 171.Swim Equake 1.0 N/A N/A Mgrid Applu Ammp Mesa Galgel 1.0 N/A N/A Art 1.0 N/A N/A Lucas 0.5 N/A N/A Apsi PSOC RUUU- LSQU- LSQU- LSQU- I- I- D- L2 I D L2 D L2 L2 171.Swim

152 Table A4. (continued) 183.Equake 1.0 N/A N/A N/A Mgrid Applu Ammp Mesa Galgel 0.5 N/A N/A N/A Art 0.5 N/A N/A N/A Lucas 0.5 N/A N/A N/A Apsi PSOC WSOC MOC 171.Swim Equake Mgrid Applu Ammp Mesa Galgel Art Lucas Apsi PSOC

153 A.3 Correlation Table of the Berkeley Multimedia Benchmarks Table A5: Correlation Coefficient Table of Berkeley Multimedia Multimedia Program IPC- IPB IPC- IFQU IPC- RUUU IPC- LSQU IPC- I IPC- D IPC- L2 Ghostscript N/A Mpg Gsm dec Gsm en Mpeg dec Mpeg en IPB- IFQU IPB- RUUU IPB- LSQU IPB- I IPB- D IPB- L2 IFQU- RUUU Ghostscript N/A 0.73 Mpg Gsm dec Gsm en Mpeg dec Mpeg en IFQU- LSQU IFQU- I IFQU- D IFQU- L2 RUUU- LSQU RUUU- I RUUU- D Ghostscript N/A Mpg Gsm dec Gsm en Mpeg dec Mpeg en RUUU- L2 LSQU- I LSQU- D LSQU- L2 I- D I- L2 D- L2 Ghostscript N/A N/A 0.05 N/A N/A Mpg

154 Table A5. (continued) Gsm dec Gsm en Mpeg dec Mpeg en Table A6: Standard Correlation Coefficient Table of Berkeley Multimedia Multimedia Program IPC- IPB IPC- IFQU IPC- RUUU IPC- LSQU IPC- I IPC- D IPC- L2 Ghostscript N/A Mpg Gsm dec Gsm en Mpeg dec Mpeg en PSOC IPB- IFQU IPB- RUUU IPB- LSQU IPB- I IPB- D IPB- L2 IFQU- RUUU Ghostscript N/A 1.0 Mpg Gsm dec Gsm en Mpeg dec Mpeg en PSOC IFQU- LSQU IFQU- I IFQU- D IFQU- L2 RUUU- LSQU RUUU- I RUUU- D Ghostscript N/A Mpg Gsm dec Gsm en

155 Table A6. (continued) Mpeg dec Mpeg en PSOC RUUU- L2 LSQU- I LSQU- D LSQU- L2 I- D I- L2 D- L2 Ghostscript N/A N/A 0.0 N/A N/A Mpg Gsm dec Gsm en Mpeg dec Mpeg en PSOC WSOC MOC Ghostscript Mpg Gsm dec Gsm en Mpeg dec Mpeg en PSOC

156 B PC1 AND PC2 (PCA PLOT) B.1 PCA Plot of the SPEC2000INT Benchmarks (the rest) Figure B1: PCA plot of Gzip Figure B2: PCA plot of Gcc 138

157 Figure B3: PCA plot of Twolf Figure B4: PCA plot of Vpr 139

158 Figure B5: PCA plot of Mcf B.2 PCA Plot of the SPEC2000FP Benchmarks (the rest) Figure B6: PCA plot of Swim 140

159 Figure B7: PCA plot of Mgrid Figure B8: PCA plot of Ammp 141

160 Figure B9: PCA plot of Apsi B.3 PCA Plot of the Berkeley Multimedia Benchmarks (the rest) Figure B10: PCA plot of Gsm enc 142

161 Figure B11: PCA plot of Mpeg2 dec DVD Figure B12: PCA plot of Mpeg2 enc DVD 143

162 C PC1 TIME SERIES C.1 PC1 Time Series of the SPEC2000INT Benchmarks (the rest) Gzip PC1 PC # Sampling point Figure C1: PC1 Time Series of Gzip Twolf PC1 PC # Sampling points Figure C2: PC1 Time Series of Twolf 144

163 Vpr PC1 PC # Sampling points Figure C3: PC1 Time Series of Vpr C.2 PC1 Time Series of the SPEC2000FP Benchmarks (the rest) Swim PC1 PC # Sampling points Figure C4: PC1 Time Series of Swim 145

164 Equake PC1 PC # Sampling points Figure C5: PC1 Time Series of Equake Mgrid PC1 PC # Sampling points Figure C6: PC1 Time Series of Mgrid Ammp PC1 PC # Sampling points Figure C7: PC1 Time Series of Ammp 146

165 Mesa PC1 PC # Sampling points Figure C8: PC1 Time Series of Mesa Lucas PC1 PC # Sampling points Figure C9: PC1 Time Series of Lucas Apsi PC1 PC # Sampling points Figure C10: PC1 Time Series of Apsi 147

166 C.3 PC1 Time Series of the Berkeley Multimedia Benchmarks (the rest) Mpg123 PC1 4 PC # Sampling points Figure C11: PC1 Time Series of Mpg123 Gsm dec PC1 PC # Sampling points Figure C12: PC1 Time Series of Gsm dec 148

167 Gsm en PC1 PC # Sampling points Figure C13: PC1 Time Series of Gsm en Mpeg2 dec PC1 PC # Sampling points Figure C14: PC1 Time Series of Mpeg2 dec DVD 149

168 D PC1 FREQUENCY D.1 PC1 Frequency of the SPEC2000INT Benchmarks (the rest) Figure D1: PC1 Frequency of Gzip Figure D2: PC1 Frequency of Twolf 150

169 Figure D3: PC1 Frequency of Vpr D.2 PC1 Frequency of the SPEC2000FP Benchmarks (the rest) Figure D4: PC1 Frequency of Swim 151

170 Figure D5: PC1 Frequency of Equake Figure D6: PC1 Frequency of Mgrid 152

171 Figure D7: PC1 Frequency of Mesa Figure D8: PC1 Frequency of Lucas 153

172 Figure D9: PC1 Frequency of Ammp Figure D10: PC1 Frequency of Apsi 154

173 D.3 PC1 Frequency of the Berkeley Multimedia Benchmarks (the rest) Figure D11: PC1 Frequency of Ghostscript Figure D12: PC1 Frequency of Gsm dec 155

174 Figure D13: PC1 Frequency of Gsm enc Figure D14: PC1 Frequency of Mpeg2 enc DVD 156

175 E FEATURE BEHAVIOR E.1 Feature Behavior of the SPEC2000INT Benchmarks (the rest) # Instructions Figure E1: Feature Behavior of Gzip # Instructions Figure E2: Feature Behavior of Gcc 157

176 # Instructions Figure E3: Feature Behavior of Crafty # Instructions Figure E4: Feature Behavior of Vortex 158

177 # Instructions Figure E5: Feature Behavior of Vpr # Instructions Figure E6: Feature Behavior of Mcf 159

178 # Instructions Figure E7: Feature Behavior of Eon # Instructions Figure E8: Feature Behavior of Bzip2 160

179 E.2 Feature Behavior of the SPEC2000FP Benchmarks (the rest) # Instructions Figure E9: Feature Behavior of Equake # Instructions Figure E10: Feature Behavior of Mgrid 161

180 # Instructions Figure E11: Feature Behavior of Applu # Instructions Figure E12: Feature Behavior of Ammp 162

181 # Instructions Figure E13: Feature Behavior of Galgel # Instructions Figure E14: Feature Behavior of Lucas 163

182 # Instructions Figure E15: Feature Behavior of Apsi E.3 Feature Behavior of the Berkeley Multimedia Benchmarks (the rest) # Instructions Figure E16: Feature Behavior of Ghostscript 164

183 # Instructions Figure E17: Feature Behavior of Mpg123 # Instructions Figure E18: Feature Behavior of Mpeg2 dec DVD 165

184 # Instructions Figure E19: Feature Behavior of Mpeg2 enc DVD 166

A Detailed Study on Phase Predictors

A Detailed Study on Phase Predictors A Detailed Study on Phase Predictors Frederik Vandeputte, Lieven Eeckhout, and Koen De Bosschere Ghent University, Electronics and Information Systems Department Sint-Pietersnieuwstraat 41, B-9000 Gent,

More information

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Raj Parihar Advisor: Prof. Michael C. Huang March 22, 2013 Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

More information

Branch Prediction using Advanced Neural Methods

Branch Prediction using Advanced Neural Methods Branch Prediction using Advanced Neural Methods Sunghoon Kim Department of Mechanical Engineering University of California, Berkeley shkim@newton.berkeley.edu Abstract Among the hardware techniques, two-level

More information

Cache Contention and Application Performance Prediction for Multi-Core Systems

Cache Contention and Application Performance Prediction for Multi-Core Systems Cache Contention and Application Performance Prediction for Multi-Core Systems Chi Xu, Xi Chen, Robert P. Dick, Zhuoqing Morley Mao University of Minnesota, University of Michigan IEEE International Symposium

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

ICS 233 Computer Architecture & Assembly Language

ICS 233 Computer Architecture & Assembly Language ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by

More information

Complex Dynamics of Microprocessor Performances During Program Execution

Complex Dynamics of Microprocessor Performances During Program Execution Complex Dynamics of Microprocessor Performances During Program Execution Regularity, Chaos, and Others Hugues BERRY, Daniel GRACIA PÉREZ, Olivier TEMAM Alchemy, INRIA, Orsay, France www-rocq.inria.fr/

More information

CSCI-564 Advanced Computer Architecture

CSCI-564 Advanced Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA

More information

AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE

AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE CARNEGIE MELLON UNIVERSITY AFRAMEWORK FOR STATISTICAL MODELING OF SUPERSCALAR PROCESSOR PERFORMANCE A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for the degree

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper

More information

Power-Aware Branch Prediction: Characterization and Design

Power-Aware Branch Prediction: Characterization and Design Power-Aware Branch Prediction: Characterization and Design Dharmesh Parikh, Kevin Skadron, Yan Zhang, Mircea Stan Abstract This paper uses Wattch and the SPEC 2 integer and floating-point benchmarks to

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law Topics 2 Page The Nature of Time real (i.e. wall clock) time = User Time: time spent

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Topics Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law 2 The Nature of Time real (i.e. wall clock) time = User Time: time spent executing

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

Performance, Power & Energy

Performance, Power & Energy Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance?

More information

Profile-Based Adaptation for Cache Decay

Profile-Based Adaptation for Cache Decay Profile-Based Adaptation for Cache Decay KARTHIK SANKARANARAYANAN and KEVIN SKADRON University of Virginia Cache decay is a set of leakage-reduction mechanisms that put cache lines that have not been accessed

More information

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

Scalable Store-Load Forwarding via Store Queue Index Prediction

Scalable Store-Load Forwarding via Store Queue Index Prediction Scalable Store-Load Forwarding via Store Queue Index Prediction Tingting Sha, Milo M.K. Martin, Amir Roth University of Pennsylvania {shatingt, milom, amir}@cis.upenn.edu addr addr addr (CAM) predictor

More information

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I. Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

Summarizing Measured Data

Summarizing Measured Data Summarizing Measured Data 12-1 Overview Basic Probability and Statistics Concepts: CDF, PDF, PMF, Mean, Variance, CoV, Normal Distribution Summarizing Data by a Single Number: Mean, Median, and Mode, Arithmetic,

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.

More information

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits An Enhanced Two-Level Adaptive Multiple Branch Prediction for Superscalar Processors Jong-bok Lee, Soo-Mook Moon and Wonyong Sung fjblee@mpeg,smoon@altair,wysung@dspg.snu.ac.kr School of Electrical Engineering,

More information

Unit 6: Branch Prediction

Unit 6: Branch Prediction CIS 501: Computer Architecture Unit 6: Branch Prediction Slides developed by Joe Devie/, Milo Mar4n & Amir Roth at Upenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi,

More information

Portland State University ECE 587/687. Branch Prediction

Portland State University ECE 587/687. Branch Prediction Portland State University ECE 587/687 Branch Prediction Copyright by Alaa Alameldeen and Haitham Akkary 2015 Branch Penalty Example: Comparing perfect branch prediction to 90%, 95%, 99% prediction accuracy,

More information

CPSC 3300 Spring 2017 Exam 2

CPSC 3300 Spring 2017 Exam 2 CPSC 3300 Spring 2017 Exam 2 Name: 1. Matching. Write the correct term from the list into each blank. (2 pts. each) structural hazard EPIC forwarding precise exception hardwired load-use data hazard VLIW

More information

Stochastic Dynamic Thermal Management: A Markovian Decision-based Approach. Hwisung Jung, Massoud Pedram

Stochastic Dynamic Thermal Management: A Markovian Decision-based Approach. Hwisung Jung, Massoud Pedram Stochastic Dynamic Thermal Management: A Markovian Decision-based Approach Hwisung Jung, Massoud Pedram Outline Introduction Background Thermal Management Framework Accuracy of Modeling Policy Representation

More information

Temperature-Aware Floorplanning of Microarchitecture Blocks with IPC-Power Dependence Modeling and Transient Analysis

Temperature-Aware Floorplanning of Microarchitecture Blocks with IPC-Power Dependence Modeling and Transient Analysis Temperature-Aware Floorplanning of Microarchitecture Blocks with IPC-Power Dependence Modeling and Transient Analysis Vidyasagar Nookala David J. Lilja Sachin S. Sapatnekar ECE Dept, University of Minnesota,

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 09/10, Jan., 2018 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 43 Most Essential Assumptions for Real-Time Systems Upper

More information

Performance of Computers. Performance of Computers. Defining Performance. Forecast

Performance of Computers. Performance of Computers. Defining Performance. Forecast Performance of Computers Which computer is fastest? Not so simple scientific simulation - FP performance program development - Integer performance commercial work - I/O Performance of Computers Want to

More information

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference) ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC

More information

Potentials of Branch Predictors from Entropy Viewpoints

Potentials of Branch Predictors from Entropy Viewpoints Potentials of Branch Predictors from Entropy Viewpoints Takashi Yokota,KanemitsuOotsu, and Takanobu Baba Department of Information Science, Utsunomiya University, 7 2 Yoto, Utsunomiya-shi, Tochigi, 32

More information

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide) Out-of-order Pipeline Buffer of instructions Issue = Select + Wakeup Select N oldest, read instructions N=, xor N=, xor and sub Note: ma have execution resource constraints: i.e., load/store/fp Fetch Decode

More information

CMP 338: Third Class

CMP 338: Third Class CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does

More information

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018 ECE 172 Digital Systems Chapter 12 Instruction Pipelining Herbert G. Mayer, PSU Status 7/20/2018 1 Syllabus l Scheduling on Pipelined Architecture l Idealized Pipeline l Goal of Scheduling l Causes for

More information

Lecture 2: Metrics to Evaluate Systems

Lecture 2: Metrics to Evaluate Systems Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video

More information

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu Performance Metrics for Computer Systems CASS 2018 Lavanya Ramapantulu Eight Great Ideas in Computer Architecture Design for Moore s Law Use abstraction to simplify design Make the common case fast Performance

More information

256.bzip2, ref.graphic. Datasets profile vs. Reference Dataset. 256.bzip2, ref.graphic

256.bzip2, ref.graphic. Datasets profile vs. Reference Dataset. 256.bzip2, ref.graphic Datasets profile vs. Reference Dataset The following are the profiles for the benchmark. For more details about our profile development and dataset reduction methodology, refer to the paper by AJ KleinOsowski

More information

Formal Online Methods for Voltage/Frequency Control in Multiple Clock Domain Microprocessors

Formal Online Methods for Voltage/Frequency Control in Multiple Clock Domain Microprocessors Formal Online Methods for Voltage/Frequency Control in Multiple Clock Domain Microprocessors Qiang Wu Philo Juang Margaret Martonosi Douglas W. Clark Dept. of Computer Science, Dept. of Electrical Engineering

More information

Control-Theoretic Techniques and Thermal-RC Modeling for Accurate and Localized Dynamic Thermal Management

Control-Theoretic Techniques and Thermal-RC Modeling for Accurate and Localized Dynamic Thermal Management Control-Theoretic Techniques and Thermal-RC Modeling for Accurate and Localized ynamic Thermal Management UNIV. O VIRGINIA EPT. O COMPUTER SCIENCE TECH. REPORT CS-2001-27 Kevin Skadron, Tarek Abdelzaher,

More information

USING ON-CHIP EVENT COUNTERS FOR HIGH-RESOLUTION, REAL-TIME TEMPERATURE MEASUREMENT 1

USING ON-CHIP EVENT COUNTERS FOR HIGH-RESOLUTION, REAL-TIME TEMPERATURE MEASUREMENT 1 USING ON-CHIP EVENT COUNTERS FOR HIGH-RESOLUTION, REAL-TIME TEMPERATURE MEASUREMENT 1 Sung Woo Chung and Kevin Skadron Division of Computer Science and Engineering, Korea University, Seoul 136-713, Korea

More information

Lecture: Pipelining Basics

Lecture: Pipelining Basics Lecture: Pipelining Basics Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4:

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting

More information

Microarchitectural Techniques for Power Gating of Execution Units

Microarchitectural Techniques for Power Gating of Execution Units 2.2 Microarchitectural Techniques for Power Gating of Execution Units Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor Zyuban, Hans Jacobson, Pradip Bose IBM T. J. Watson Research Center ABSTRACT

More information

Machine Learning to Automatically Detect Human Development from Satellite Imagery

Machine Learning to Automatically Detect Human Development from Satellite Imagery Technical Disclosure Commons Defensive Publications Series April 24, 2017 Machine Learning to Automatically Detect Human Development from Satellite Imagery Matthew Manolides Follow this and additional

More information

EE 660: Computer Architecture Out-of-Order Processors

EE 660: Computer Architecture Out-of-Order Processors EE 660: Computer Architecture Out-of-Order Processors Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Based on the slides of Prof. David entzlaff Agenda I4 Processors I2O2

More information

ww.padasalai.net

ww.padasalai.net t w w ADHITHYA TRB- TET COACHING CENTRE KANCHIPURAM SUNDER MATRIC SCHOOL - 9786851468 TEST - 2 COMPUTER SCIENC PG - TRB DATE : 17. 03. 2019 t et t et t t t t UNIT 1 COMPUTER SYSTEM ARCHITECTURE t t t t

More information

Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors

Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors Qiang Wu Philo Juang Margaret Martonosi Douglas W. Clark Depts. of Computer Science and Electrical Engineering

More information

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control The MIPS Pipeline CSCI206 - Computer Organization & Programming Pipeline Datapath and Control zybook: 11.6 Developed and maintained by the Bucknell University Computer Science Department - 2017 Hazard

More information

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2 Pipelining CS 365 Lecture 12 Prof. Yih Huang CS 365 1 Traditional Execution 1 2 3 4 1 2 3 4 5 1 2 3 add ld beq CS 365 2 1 Pipelined Execution 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

More information

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example This Unit: Scheduling (Static + Dnamic) CIS 50 Computer Architecture Unit 8: Static and Dnamic Scheduling Application OS Compiler Firmware CPU I/O Memor Digital Circuits Gates & Transistors! Previousl:!

More information

Computer Architecture

Computer Architecture Lecture 2: Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture CPU Evolution What is? 2 Outline Measurements and metrics : Performance, Cost, Dependability, Power Guidelines

More information

/ : Computer Architecture and Design

/ : Computer Architecture and Design 16.482 / 16.561: Computer Architecture and Design Summer 2015 Homework #5 Solution 1. Dynamic scheduling (30 points) Given the loop below: DADDI R3, R0, #4 outer: DADDI R2, R1, #32 inner: L.D F0, 0(R1)

More information

3. (2) What is the difference between fixed and hybrid instructions?

3. (2) What is the difference between fixed and hybrid instructions? 1. (2 pts) What is a "balanced" pipeline? 2. (2 pts) What are the two main ways to define performance? 3. (2) What is the difference between fixed and hybrid instructions? 4. (2 pts) Clock rates have grown

More information

256.bzip2, ref.source. Datasets profile vs. Reference Dataset. 256.bzip2, ref.source

256.bzip2, ref.source. Datasets profile vs. Reference Dataset. 256.bzip2, ref.source Datasets profile vs. Reference Dataset The following are the profiles for the benchmark. For more details about our profile development and dataset reduction methodology, refer to the paper by AJ KleinOsowski

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

Enrico Nardelli Logic Circuits and Computer Architecture

Enrico Nardelli Logic Circuits and Computer Architecture Enrico Nardelli Logic Circuits and Computer Architecture Appendix B The design of VS0: a very simple CPU Rev. 1.4 (2009-10) by Enrico Nardelli B - 1 Instruction set Just 4 instructions LOAD M - Copy into

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] What are the two main ways to define performance? [2] Predicting the direction of a branch is not enough. What else is necessary? [2] The power consumed by a chip has increased over time, but the clock

More information

WARM SRAM: A Novel Scheme to Reduce Static Leakage Energy in SRAM Arrays

WARM SRAM: A Novel Scheme to Reduce Static Leakage Energy in SRAM Arrays WARM SRAM: A Novel Scheme to Reduce Static Leakage Energy in SRAM Arrays Mahadevan Gomathisankaran Iowa State University gmdev@iastate.edu Akhilesh Tyagi Iowa State University tyagi@iastate.edu ➀ Introduction

More information

Vector Lane Threading

Vector Lane Threading Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program

More information

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 2. (2 )What are the two main ways to define performance? 3. (2 )What

More information

Fall 2011 Prof. Hyesoon Kim

Fall 2011 Prof. Hyesoon Kim Fall 2011 Prof. Hyesoon Kim Add: 2 cycles FE_stage add r1, r2, r3 FE L ID L EX L MEM L WB L add add sub r4, r1, r3 sub sub add add mul r5, r2, r3 mul sub sub add add mul sub sub add add mul sub sub add

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern

More information

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction Pipeline no Prediction Branching completes in 2 cycles We know the target address after the second stage? PC fetch Instruction Memory Decode Check the condition Calculate the branch target address PC+4

More information

Reducing State Loss For Effective Trace Sampling of Superscalar Processors

Reducing State Loss For Effective Trace Sampling of Superscalar Processors Reducing State Loss For Effective Trace Sampng of Superalar Processors Thomas M. Conte Mary Ann Hirh Kishore N. Menezes Department of Electrical and Computer Engineering North Carona State University Raleigh,

More information

Microprocessor Power Analysis by Labeled Simulation

Microprocessor Power Analysis by Labeled Simulation Microprocessor Power Analysis by Labeled Simulation Cheng-Ta Hsieh, Kevin Chen and Massoud Pedram University of Southern California Dept. of EE-Systems Los Angeles CA 989 Outline! Introduction! Problem

More information

Enhancing the Sniper Simulator with Thermal Measurement

Enhancing the Sniper Simulator with Thermal Measurement Proceedings of the 18th International Conference on System Theory, Control and Computing, Sinaia, Romania, October 17-19, 214 Enhancing the Sniper Simulator with Thermal Measurement Adrian Florea, Claudiu

More information

Exploiting Bias in the Hysteresis Bit of 2-bit Saturating Counters in Branch Predictors

Exploiting Bias in the Hysteresis Bit of 2-bit Saturating Counters in Branch Predictors Journal of Instruction-Level Parallelism 5(23) -32 Submitted 2/2; published 6/3 Exploiting Bias in the Hysteresis Bit of 2-bit Saturating Counters in Branch Predictors Gabriel H. Loh Dana S. Henry Arvind

More information

Microprocessor Floorplanning with Power Load Aware Temporal Temperature Variation

Microprocessor Floorplanning with Power Load Aware Temporal Temperature Variation Microprocessor Floorplanning with Power Load Aware Temporal Temperature Variation Chun-Ta Chu, Xinyi Zhang, Lei He, and Tom Tong Jing Department of Electrical Engineering, University of California at Los

More information

Project Two RISC Processor Implementation ECE 485

Project Two RISC Processor Implementation ECE 485 Project Two RISC Processor Implementation ECE 485 Chenqi Bao Peter Chinetti November 6, 2013 Instructor: Professor Borkar 1 Statement of Problem This project requires the design and test of a RISC processor

More information

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering

More information

Figure 4.9 MARIE s Datapath

Figure 4.9 MARIE s Datapath Term Control Word Microoperation Hardwired Control Microprogrammed Control Discussion A set of signals that executes a microoperation. A register transfer or other operation that the CPU can execute in

More information

CprE 281: Digital Logic

CprE 281: Digital Logic CprE 28: Digital Logic Instructor: Alexander Stoytchev http://www.ece.iastate.edu/~alexs/classes/ Simple Processor CprE 28: Digital Logic Iowa State University, Ames, IA Copyright Alexander Stoytchev Digital

More information

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m ) A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m ) Stefan Tillich, Johann Großschädl Institute for Applied Information Processing and

More information

Transposition Mechanism for Sparse Matrices on Vector Processors

Transposition Mechanism for Sparse Matrices on Vector Processors Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands

More information

Opleiding Informatica

Opleiding Informatica Opleiding Informatica Energy Efficiency across Programming Languages Revisited Emiel Beinema Supervisors: Kristian Rietveld & Erik van der Kouwe BACHELOR THESIS Leiden Institute of Advanced Computer Science

More information

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT CSE 560 Practice Problem Set 4 Solution 1. In this question, you will examine several different schemes for branch prediction, using the following code sequence for a simple load store ISA with no branch

More information

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Yuchun Ma* Zhuoyuan Li* Jason Cong Xianlong Hong Glenn Reinman Sheqin Dong* Qiang Zhou *Department of Computer Science &

More information

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 11

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 11 EEC 686/785 Modeling & Performance Evaluation of Computer Systems Lecture Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org (based on Dr. Raj Jain s lecture

More information

Fast Path-Based Neural Branch Prediction

Fast Path-Based Neural Branch Prediction Fast Path-Based Neral Branch Prediction Daniel A. Jiménez http://camino.rtgers.ed Department of Compter Science Rtgers, The State University of New Jersey Overview The context: microarchitectre Branch

More information

BACHELOR OF TECHNOLOGY DEGREE PROGRAM IN COMPUTER SCIENCE AND ENGINEERING B.TECH (COMPUTER SCIENCE AND ENGINEERING) Program,

BACHELOR OF TECHNOLOGY DEGREE PROGRAM IN COMPUTER SCIENCE AND ENGINEERING B.TECH (COMPUTER SCIENCE AND ENGINEERING) Program, BACHELOR OF TECHNOLOGY DEGREE PROGRAM IN COMPUTER SCIENCE AND ENGINEERING B.TECH (COMPUTER SCIENCE AND ENGINEERING) Program, 2018-2022 3.1 PROGRAM CURRICULUM 3.1.1 Mandatory Courses and Credits The B.Tech

More information

How to deal with uncertainties and dynamicity?

How to deal with uncertainties and dynamicity? How to deal with uncertainties and dynamicity? http://graal.ens-lyon.fr/ lmarchal/scheduling/ 19 novembre 2012 1/ 37 Outline 1 Sensitivity and Robustness 2 Analyzing the sensitivity : the case of Backfilling

More information

Summarizing Measured Data

Summarizing Measured Data Summarizing Measured Data Dr. John Mellor-Crummey Department of Computer Science Rice University johnmc@cs.rice.edu COMP 528 Lecture 7 3 February 2005 Goals for Today Finish discussion of Normal Distribution

More information

Lecture 3, Performance

Lecture 3, Performance Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations

More information

Computer Science. Questions for discussion Part II. Computer Science COMPUTER SCIENCE. Section 4.2.

Computer Science. Questions for discussion Part II. Computer Science COMPUTER SCIENCE. Section 4.2. COMPUTER SCIENCE S E D G E W I C K / W A Y N E PA R T I I : A L G O R I T H M S, T H E O R Y, A N D M A C H I N E S Computer Science Computer Science An Interdisciplinary Approach Section 4.2 ROBERT SEDGEWICK

More information

CSE370: Introduction to Digital Design

CSE370: Introduction to Digital Design CSE370: Introduction to Digital Design Course staff Gaetano Borriello, Brian DeRenzi, Firat Kiyak Course web www.cs.washington.edu/370/ Make sure to subscribe to class mailing list (cse370@cs) Course text

More information

Lecture 13: Sequential Circuits, FSM

Lecture 13: Sequential Circuits, FSM Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines 1 Clocks A microprocessor is composed of many different circuits that are operating simultaneously if each

More information

4. (3) What do we mean when we say something is an N-operand machine?

4. (3) What do we mean when we say something is an N-operand machine? 1. (2) What are the two main ways to define performance? 2. (2) When dealing with control hazards, a prediction is not enough - what else is necessary in order to eliminate stalls? 3. (3) What is an "unbalanced"

More information

Embedded Systems 23 BF - ES

Embedded Systems 23 BF - ES Embedded Systems 23-1 - Measurement vs. Analysis REVIEW Probability Best Case Execution Time Unsafe: Execution Time Measurement Worst Case Execution Time Upper bound Execution Time typically huge variations

More information

Lecture 5: Performance (Sequential) James C. Hoe Department of ECE Carnegie Mellon University

Lecture 5: Performance (Sequential) James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 5: Performance (Sequential) James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L05 S1, James C. Hoe, CMU/ECE/CALCM, 2018 18 447 S18 L05 S2, James C. Hoe, CMU/ECE/CALCM,

More information

COVER SHEET: Problem#: Points

COVER SHEET: Problem#: Points EEL 4712 Midterm 3 Spring 2017 VERSION 1 Name: UFID: Sign here to give permission for your test to be returned in class, where others might see your score: IMPORTANT: Please be neat and write (or draw)

More information

Introduction The Nature of High-Performance Computation

Introduction The Nature of High-Performance Computation 1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential

More information

A Novel Meta Predictor Design for Hybrid Branch Prediction

A Novel Meta Predictor Design for Hybrid Branch Prediction A Novel Meta Predictor Design for Hybrid Branch Prediction YOUNG JUNG AHN, DAE YON HWANG, YONG SUK LEE, JIN-YOUNG CHOI AND GYUNGHO LEE The Dept. of Computer Science & Engineering Korea University Anam-dong

More information

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory

More information

Simple Neural Nets For Pattern Classification

Simple Neural Nets For Pattern Classification CHAPTER 2 Simple Neural Nets For Pattern Classification Neural Networks General Discussion One of the simplest tasks that neural nets can be trained to perform is pattern classification. In pattern classification

More information

CHIP POWER consumption is expected to increase with

CHIP POWER consumption is expected to increase with IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 10, OCTOBER 2009 1503 Utilizing Predictors for Efficient Thermal Management in Multiprocessor SoCs Ayşe Kıvılcım

More information

Computer Architecture. ESE 345 Computer Architecture. Performance and Energy Consumption. CA: Performance and Energy

Computer Architecture. ESE 345 Computer Architecture. Performance and Energy Consumption. CA: Performance and Energy Computer Architecture ESE 345 Computer Architecture Performance and Energy Consumption 1 Two Notions of Performance Plane Boeing 747 DC to Paris 6.5 hours Top Speed 610 mph Passengers Throughput (pmph)

More information

Introduction to Computer Engineering. CS/ECE 252, Fall 2012 Prof. Guri Sohi Computer Sciences Department University of Wisconsin Madison

Introduction to Computer Engineering. CS/ECE 252, Fall 2012 Prof. Guri Sohi Computer Sciences Department University of Wisconsin Madison Introduction to Computer Engineering CS/ECE 252, Fall 2012 Prof. Guri Sohi Computer Sciences Department University of Wisconsin Madison Chapter 3 Digital Logic Structures Slides based on set prepared by

More information