Chapter 1 Exercises. 2. Give reasons why 10 bricklayers would not in real life build a wall 10 times faster than one bricklayer.

Size: px
Start display at page:

Download "Chapter 1 Exercises. 2. Give reasons why 10 bricklayers would not in real life build a wall 10 times faster than one bricklayer."

Transcription

1 Chapter 1 Exercises 37 Chapter 1 Exercises 1. Look up the definition of parallel in your favorite dictionary. How does it compare to the definition of parallelism given in this chapter? 2. Give reasons why 10 bricklayers would not in real life build a wall 10 times faster than one bricklayer. 3. Ten volunteers are formed into a bucket brigade between a small pond and a cabin on fire. Why is this better than each volunteer working individually carrying water to the fire. Analyze the bucket brigade as a pipeline. How are the buckets returned? 4. Using an assembly line, one wants the conveyor belt to move as fast as possible in order to produce the most widgets per unit time. What determines the maximum speed of the conveyor belt? 5. Assume a conveyor belt assembly line of five tasks. Each task takes T units of time. What is the speedup for manufacturing a). 10 units? b). 100 units? c) units? 6. Plot the graph of the results of problem 5. What is the shape of the curve? 7. Given the assembly line of problem 5, what are the speedups if one of the tasks takes 2T units of time to complete? 8. Assume a conveyor belt assembly line of five tasks. One task takes 2T units of time and the other four each take T units of time. How could the differences in task times be accommodated? 9. Simple Simon has learned that the asymptotic speedup of an n station pipeline is n. Given a 5 station pipeline, Simple Simon figures he can make it twice as fast if he makes it into a 10 station pipeline by adding 5 do nothing stages. Is Simple Simon right in his thinking? Explain why or why not. 10. Select a parallel computer or parallel programming language and write a paper on its history.

2

3 2.1 Measures of Performance 37 Chapter 2 Measuring Performance This chapter focuses on measuring the performance of parallel computers. Many customers buy a parallel computer primarily for increased performance. Therefore, measuring performance accurately and in a meaningful manner is important. In this chapter, we will explore several measures of performance assuming a scientific computing environment. After selecting a good measure, we discuss the use of benchmarks to gather performance data. Since the performance of parallel architectures is harder to characterize than scalar ones, a performance model by Hockney is introduced. The performance model is used to identify performance issues with vector processors and SIMD machines. Next, we discuss several special problems associated with the performance of MIMD machines. Also, we discuss how to measure the performance of the new massively parallel machines. Lastly, we explore the physical and algorithmic limitations to increasing performance on parallel computers. 2.1 Measures of Performance First, we will consider measures of performance. Many measures of performance could be suggested, for example, instructions per second, disk reads and writes per second, memory accesses per second, or bus accesses per second. Before we can decide on a measure, we must ask ourselves what are we measuring? With parallel computers in a scientific computing environment, we are mostly concerned with CPU computing speed in performing numerical calculations. Therefore, a potential measure might be CPU instructions per second. However, in the next section we will find that this is a poor measure MIPS as a Performance Measure We all have seen advertisements claiming that such and such company has an X MIPS machine, where X is 50, 100 or whatever. The measure MIPS (Millions of Instructions Per Second) sounds impressive. However, it is a poor measure of performance, since processors have widely varying instruction sets. Consider, for example, the following data for a CISC (Complex Instruction Set Computer) Motorola MC68000 microprocessor and a RISC (Reduced Instruction Set Computer) Inmos T424 microprocessor. Processor Total Number of Instructions Time in Seconds MC , T , Fig. 2.1 Performance Data for the Sieve of Eratosthenes Benchmark 10 Both microprocessors are solving the same problem, i. e., a widely used benchmark for evaluating microprocessor performance called the Sieve of Eratosthenes, which finds all the prime numbers up to 10,000. Notice that the T424 with its simpler instruction set must perform almost five times as many instructions as the MC rate = total number of instructions time to solve problem rate T424 = = 18.0 MIPS 10 Inmos Technical Note 3: "Ims T424 - MC68000 Performance Comparison"

4 38 CHAPTER 2 MEASURING PERFORMANCE rate MC68000 = = 1.0 MIPS The T424 running the Sieve program executes instructions at the rate of 18.0 MIPS. In contrast, the MC68000 running the same Sieve program executes instructions at the rate of 1.0 MIPS. Although the T424's MIPS rating is 18 times the MC68000's MIPS rating, the T424 is only 3.6 times faster than the MC We conclude that the MIPS rating is not a good indicator of speed. We must be suspect when we see performance comparisons stated in MIPS. If MIPS is a poor measure, what is a good measure? MFLOPS as a Performance Measure A reasonable measure for scientific computations is Millions of FLoating-point Operations Per Second (MFLOPS or Mega FLOPS). Since a typical scientific or engineering program contains a high percentage of floating-point operations, a good candidate for a performance measure is MFLOPS. Most of the time spent executing scientific programs is calculating floating-point values inside of nested loops. Clearly, not all work in a scientific environment is floating-point intensive, e. g., compiling a FORTRAN program. However, the computing industry has found MFLOPS to be a useful measure. Of course, some applications such as expert systems do very few floating point calculations and an MFLOPS rating is rather meaningless. A possible measure for expert systems might be the number of logical inferences per second. 2.2 MFLOPS Performance of Supercomputers Over Time To demonstrate the increase in MFLOPS over the last two decades, the chart below shows some representative parallel machines and their theoretical peak MFLOPS rating. Each was the fastest machine in its day. The chart also includes the number of processors contained in the machine and the year the first machine was shipped to a customer. Year Peak MFLOPS Number of Processors CDC ILLIAC IV Cray CDC Cray X-MP/ Cray Y-MP/ Cray Y-MP C NEC SX-3/ Fig. 2.2 Peak MFLOPS for the Fastest Computer in That Year From the chart, we see that the MFLOPS rating has risen at a phenomenal rate in the last two decades. To see if there are any trends, we plot the Peak MFLOPS on a logarithmic scale versus the year. The result is almost a straight line! This means the performance increases tenfold about every five years. Can the computer industry continue at this rate? The indications are they can for at least another decade.

5 2.2 MFLOPS of Supercomputers Over Time PEAK MFLOPS Fig. 2.3 Log Plot of Peak MFLOPS in the Last Two Decades One caveat: the chart and graph use a machine's theoretical peak performance in MFLOPS. This is not the performance measured in a typical user's program. In a later section, we will explore the differences between "peak" and "useful" performance. Building a GFLOPS (Giga FLOPS or 1000 MFLOPS) machine is a major accomplishment. Do we need faster machines? Yes! In the next section, we will discuss why we need significantly higher performance. 2.3 The Need for Higher Performance Computers We saw in the last section that supercomputers have grown in performance at a phenomenal rate. Fast computers are in high demand in many scientific, engineering, energy, medical, military and basic research areas. In this section, we will focus on several applications which need enormous amounts of computing power. The first of these is numerical weather forecasting. Hwang and Briggs [Hwang, 1984] is the primary source of the information for this example. Considering the great benefits of accurate weather forecasting to navigation at sea and in the air, to food production and to the quality of life, it is not surprising that considerable effort has been expended in perfecting the art of forecasting. The weatherman s latest tool is the supercomputer which is used to predict the weather based on a simulation of an atmospheric model. For the prediction, the weather analyst needs to solve a general circulation model. The atmospheric state is represented by the surface pressure, the wind field, the temperature and the water vapor mixing ratio. These state variables are governed by the Navier-Stokes fluid dynamics equations in a spherical coordinate system. To solve the continuous Navier-Stokes equations, we discretize both the variables and equations. That is, we divide the atmosphere into threedimensional sub regions, associate a grid point with each sub region and replace the partial differential equations (defined at infinitely many points in space) with difference equations relating the discretized variables (defined on only the finitely many grid points). We initialize the state variables of each grid point based on the current weather at weather stations around the country. The computation is carried out on this three-dimensional grid that partitions the atmosphere vertically into K levels and horizontally into M intervals of longitude and N intervals of latitude. It is necessary to add a fourth dimension: the number of time steps used in the simulation. Using a grid size of 270 miles on a side, an appropriate number of vertical levels and time step, a 24-hour

6 40 CHAPTER 2 MEASURING PERFORMANCE forecast for the United States would need to perform about 100 billion data operations. forecast can be done on a Cray-1 supercomputer in about 100 minutes. This However, a grid of 270 miles on a side is very coarse. If one grid point was Washington, DC, then 270 miles north is Rochester, New York on Lake Ontario and 270 miles south is Raleigh, North Carolina. The weather can vary drastically in between these three cities! Therefore, we desire a finer grid for a better forecast. If we halve the distance on each side to 135 miles, we also need to halve the vertical level interval and the time step. Halving each of the four-dimensions requires at least 16 times more data operations. time 135 mile grid = 100 minutes x 16 = 1600 minutes = 26.7 hours Therefore, a Cray-1 would take over 26 hours to compute a 24-hour forecast. We would receive the prediction after the fact; clearly not acceptable! If we want the forecast in 100 minutes, we will need a machine 16 times as fast. If we desire a grid size of 67 miles on a side, we will need a computer 256 times faster. Since weather experts would like to model individual cloud systems which are much less in size, for example, six miles across, weather and climate researchers will never run out of their need for faster computers. From Figure 2.2, we observe that the Cray-1 is a 1976 machine and has a peak performance of 160 MFLOPS. Today s machines are a factor of 100 faster and do provide a better forecast. However, reliable long-range forecasts require an even finer grid for a lot more time steps, which is why climate modeling is a Grand Challenge Problem. In 1991, the United States Office of Science and Technology proposed a series of Grand Challenge Problems, i. e., computational areas which require a million MFLOPS (TeraFLOPS). The U. S. Government feels that effective solutions in these areas are critically important to its national economy and well being. The Grand Challenge Problems are listed in Figure 2.4. Climate Modeling - weather forecasting; global models. Fluid Turbulence - air flow over an airplane; reentry dynamics for spacecraft. Pollution Dispersion - acid rain; air and water pollution. Human Genome - mapping the Human genetic material in DNA. Ocean Circulation - long term effects; global warming Quantum Chromodynamics - particle interaction in high-energy physics. Semiconductor Modeling - routing of wires on an Integrated Circuit chip. Superconductor Modeling - special properties of materials. Combustion Systems - rocket engines. Vision and Cognition - remote driverless vehicle. Fig. 2.4 The Grand Challenge Problems that Require a TeraFLOPS Computer The U. S. computer industry hopes to provide an effective TeraFLOPS computer by the mid- 1990s. Other areas that require extensive computing are structural biology and pharmaceutical design of new drugs, for example, a cure for AIDS. Returning to the weather forecasting example, how close to peak performance did the Cray-1 achieve on this problem? To compute the actual floating-point operation per second (FLOPS), we divide the number of operations by the time spent. rate Cray-1 = 100 billion operations 100 minutes = 16.7 MFLOPS

7 2.3 Need for Higher Performance Computers 41 Surprisingly, the Cray-1, a 160 MFLOPS machine, only performed 16.7 MFLOPS on the weather problem! Why the large discrepancy? First, the FORTRAN compiler can t utilize the machine fully. Second, the pipelined arithmetic functional units are not kept busy. We will explore this issue fully in Chapter Three when we discuss vector processors such as the Cray-1. Also, in Chapter Three we will derive the Cray-1 s 160 MFLOPS rating and discuss why it rarely achieved anywhere near peak performance. At the moment, we need only understand that sustained MFLOPS on real programs and not peak MFLOPS is what is important. Many purchasers of supercomputers have been disappointed when their application programs have run only a small fraction of the salesman s quoted peak performance. One way to measure the practical MFLOPS available in a computer is to use benchmark programs. 2.4 Benchmarks as a Measurement Tool A benchmark is a computer program run on several computers to compare the computers characteristics. A benchmark might be an often-run application program which typifies the work load at a company. Using the benchmark, the company can obtain a measure of how well a new computer will perform in its environment. The computing industry uses standard benchmarks, for example, the Sieve of Eratosthenes program used in Section 2.1, to evaluate their products. Performance of a computer is based on many aspects including the CPU speed, the memory speed, the I/O speed and the compiler s effectiveness. To incorporates these other effects, we measure the CPU s overall performance by a benchmark program rather than directly, say with a hardware probe. Devising a benchmark for parallel computers is a little harder because of the wide variety of architectures. However, the computer industry has settled on several standard benchmarks including the Livermore Loops and LINPACK for measuring the performance of parallel computers. Here, we will discuss the LINPACK benchmark. Jack J. Dongarra of Oak Ridge National Laboratory compiles the performance of hundreds of computers using the standard LINPACK benchmark [Dongarra, 1992]. The LINPACK software solves dense systems of linear equations. The LINPACK programs can be characterized as having a high percentage of floating-point arithmetic operations and, therefore, are appropriate as benchmarks measuring performance in a scientific computing environment. The table in Figure 2.5 reports three numbers for each machine listed (in some cases, the numbers are missing because of lack of data). All performance numbers reflect arithmetic performed in full precision (64 bits). The third column lists the LINPACK benchmark for a matrix of order 100 in a FORTRAN environment. No changes are allowed to this code. The fourth column lists the results of solving a system of equations of order 1000, with no restrictions on the method or its implementation. The last column is the theoretical peak performance of the machine which is based not on an actual program run, but on a paper computation. This is the number manufacturers often cite; the theoretical peak MFLOPS rate represents an upper bound on performance. As Dongarra states,... the manufacturer guarantees that programs will not exceed this rate -- sort of a speed of light for a given computer. 11 The theoretical peak performance is determined by counting the number of floating-point additions and multiplications (64-bit precision) that can be completed during a period of time, usually the cycle time of the machine. For example, the Cray Y-MP/8 has a cycle time of 6 nanoseconds in which the results of both an addition and a multiplication can be completed on a single processor. 11 Dongarra, Jack J., LINPACK Benchmark: Performance of Various Computers Using Standard Linear Equations Software," Supercomputing Review, Vol. 5, No. 3, March, 1992, pp. 55.

8 42 CHAPTER 2 MEASURING PERFORMANCE 2 operations 1 cycle * 1 cycle 6 ns = 333 MFLOPS Since the Cray X-MP/8 in 1987 could have up to four processors, the peak performance is 1333 MFLOPS. The column labeled Computer gives the name of the computer hardware and indicates the number of processors and the cycle time of a processor in nanoseconds. Standard Best Effort Theoretical Peak Computer Year LINPACK LINPACK Performance CDC 6600 (100 ns) Cray-1S (12.5 ns) CDC 205 (4-pipe, 20 ns) Cray X-MP/416 (2 procs., 8.5 ns) Cray Y-MP/832 (4 procs., 6 ns) ,159 1,333 Cray Y-MP C90 (16 procs., 4.2 ns) ,715 16,000 NEC SX3/44 (4 procs., 2.9 ns) ,420 22,000 Fig. 2.5 LINPACK Benchmarks in MFLOPS for Some Supercomputers Glancing at the table, we observe that the Standard LINPACK measurements are a small percentage of the theoretical peak MFLOPS. The Standard LINPACK rating is a good approximation of the performance we would expect from an application written in FORTRAN and directly ported (no changes in the code) to the new machine. Notice that the 12 MFLOPS Standard LINPACK rating of the Cray-1 is comparable to the 16.7 MFLOPS we calculated for the weather code. For FORTRAN code, 10 to 20 MFLOPS was typical on the Cray-1. The improvements of the Best Effort column over the Standard LINPACK column reflects two effects. First, the problem size is larger (matrix of order 1000) which provides the hardware, especially arithmetic pipelines, more opportunity for reaching near-asymptotic rates. Second, modification or replacement of the algorithm and software were permitted to achieve as high an execution rate as possible. For example, a critical part of the code might be carefully hand coded in assembly language to match the architecture of the machine. As one might expect, manufacturers have worked hard to improve their LINPACK benchmarks ratings. To improve the Standard LINPACK rating for a fixed machine, one enhances the FORTRAN compiler by providing optimizations which better utilize the machine. To illustrate the possible improvement in compiler technology; Cray Research raised the Standard LINPACK rating on the Cray-1S from 12 MFLOPS in 1983 with Cray s CFT FORTRAN compiler (version 1.12) to 27 MFLOPS on a current run with their cf77 compiler (version 2.1). The LINPACK and other benchmarks provide a valuable way to compare computers in their performance on floating-point operations per second. 2.5 Hockney s Parameters r and n 1/2 Roger Hockney has developed a performance model [Hockney, 1988] which attempts to characterize the effective parallelism of a computer. His original model focused on the performance of vector computers, e. g., the Cray-1, but he has expanded his model to include 12 No data is available for the Cray-1 cited in Figure 2.2. The Cray-1S was an upgrade of the I/O and memory systems that appeared three years later. 13 The Cray X-MP/416 is an upgrade of the Cray X-MP/2 of The system clock was speeded up from 9.5 ns to 8.5 ns, which accounts for the increase in peak performance from 420 to 470 MFLOPS.

9 2.5 Hockney s Parameters r and n 1/2 43 SIMD and MIMD machines as well. First we will explore his original model; then we will explore his extensions. In the last section, we observed a large disparity between the Standard LINPACK rating and the theoretical peak performance on supercomputers, for example, 12 MFLOPS versus 160 MFLOPS on the Cray-1. For a vector computer, e. g., the Cray-1, the disparity is attributed partly to the FORTRAN compiler s inability to fully utilize the machine, especially the vector hardware. However, the major source of slowdown is the lack of work to keep the pipelined arithmetic functional units busy. Recall the pipelined floating-point adder of Section In a vector processor (see Section ), special machine instructions route vectors of floating-point values through the pipelined functional units. Only after a pipeline is full do we obtain an asymptotical speedup equal to the number of stages in the pipeline. Therefore, supercomputer floating-point performance depends heavily on the percentage of code with vector operations and on the lengths of those vectors. Scalar code (no vectors available) runs about 12 MFLOPS on the Cray-1; while highly vectorizable code achieves performance close to the theoretical peak 160 MFLOPS. To characterize the effects of this vectorization, Hockey's performance model will be derived in the next section Deriving Hockney s Performance Model In his performance model, Hockney wants to distinguish between the effects of technology, e. g., the clock speed, and the effects of parallelism. All the effects of technology are lumped together into one parameter we will call Q. The effects of parallelism are lumped into a parameter we call P. Let t be the time of a single arithmetic operation, e. g., a vector multiply, on a vector of length n. We assume t is some function of n, Q and P. We call this function F : t = F(n, Q, P) First, we consider Q, the effects of technology. For example, if we double the clock speed of a processor, we would expect to halve the time t. We observe that Q appears to be a multiplicative factor of another function which depends on n and P. We will call this new function G. t = 1 r *G(n, P) The performance or rate is related to the reciprocal of the time. In the above equation, r is the rate or the results per unit of time, e. g., results per second. Now we consider P, the effects of parallelism. If the machine is serial, i. e., with no architectural parallelism (P = 0), the time to compute a vector of n elements should take the time of one element multiplied by n. That is, when the machine is serial, P should have little or no effect. t serial = 1 r *n when P = 0 If the machine is very parallel, P should dominate n in the function G. One of the many possible equations that fits this behavior is the following simple equation: t = 1 r *(n + P) Rewriting the last equation in terms of performance by taking the reciprocal of each side:

10 44 CHAPTER 2 MEASURING PERFORMANCE performance = 1 t = r (n + P) The maximum rate in a parallel computer occurs asymptotically for vectors of infinite length, hence Hockney gives r the subscript of. performance = 1 t = r (n + P) If we assign P equal to n, then performance is one half of the maximum performance. half performance = 1 t = r (2n) when n = P Hockney names our P as n 1/2 to recall the one half factor of performance. Hockney s performance model is derived by substituting r for r and n 1/2 for P in the equation: t = 1 r *(n + n 1/2 ) Hockney s Performance Equation He claims that the two parameters r and n 1/2 completely describe the hardware performance of his idealized generic computer and give a first-order description of any real computer. These characteristic parameters are called: r - the maximum or asymptotic performance - the maximum rate of computation in floating-point operations performed per second. This occurs asymptotically for of infinite length, hence the subscript. vectors n 1/2 - the half-performance length - the vector length required to achieve half the maximum performance. For a particular machine, r and n 1/2 are constants. We can compare different machines by measuring and comparing r and n 1/2. The next section discusses how to measure the two parameters Measuring r and n 1/2 The maximum performance r and the half-performance length n 1/2 of a computer are best regarded as experimentally determined quantities by timing the performance of a computer on a test program which Hockney calls the (r, n 1/2 ) benchmark. Before looking at the benchmark program, we will explore the behavior of Hockney s performance equation. To explore Hockney s performance equation, we plot t, the time for the vector operation, versus n, the vector length, on a graph.

11 2.5 Hockney s Parameters r and n 1/2 45 Hockney's Equation t = 1/r * (n + n 1/2) Time in Seconds t slope = 1/r -n 1/2 n Length of Vector Fig. 2.6 Plot of Hockney s Performance Equation Notice that the negative of the intercept of the line with the n-axis gives the value for n 1/2 and the reciprocal of the slope of the line gives the value of r. To determine r and n 1/2, we collect data points by running a FORTRAN program, plot the points on a graph and draw the best-fit line through them. We can either eyeball the best-fit line through the points or use a linear least squares approximation. The parameter r is 1/slope of the line and n 1/2 is the negative of the n-axis intercept. The following FORTRAN program (See Figure 2.7) will vary n for one hundred values and print out the CPU-times. The program is designed for FORTRAN 77 on a UNIX-based system. On other systems, you may have to replace the ETIME routine with a routine that returns a REAL value in seconds of the elapsed CPU-time (Not wall time!). Depending on the speed of a computer, you should adjust the constant NMAX (the maximum range of N). If NMAX is set too low on a very fast machine, e. g., a Cray Y-MP, the machine will finish most or all of the calculation before the CPU clock advances. The timings will be close to the resolution of the system clock and the values will be very noisy and meaningless. If NMAX is set too high on a slow computer, e. g. an IBM PC XT, the program will run for many hours. We adjust NMAX to give a reasonably straight line without having to wait too long for the results. The program measures the CPU-time to perform the FORTRAN code for a vector multiplication as follows: DO 10 I = 1, N A(I) = B(I) * C(I) 10 CONTINUE In a pipelined vector processor, e. g., Cray-1, the DO 10 loop in the above code would be replaced with a vector instruction by the vectorizing compiler. This vector instruction will utilize the pipelined multiplication unit to give a significant increase in MFLOPS performance over a serial processor. * Performance Measurement Program for UNIX-based systems * Computes 32 bit floating point r[infinity] and n[1/2] * Hockney's performance parameters * By Dan Hyde, March 18, 1992

12 46 CHAPTER 2 MEASURING PERFORMANCE * Adjust the NMAX constant for a particular machine. * NMAX should be large enough to obtain meaningful times. PARAMETER (NMAX = ) INTEGER I, N REAL T0, T1, T2, T REAL A(NMAX), B(NMAX), C(NMAX) REAL TARRAY(2) * initialize B and C to some realistic values (non zero!) DO 5 I = 1, NMAX B(I) = 12.3 C(I) = CONTINUE * find overhead to call ETIME routine * ETIME returns elapsed execution time since start of program T1 = ETIME(TARRAY) T2 = ETIME(TARRAY) T0 = T2 - T1 * vary N for 100 times DO 20 N = (NMAX / 100), NMAX, (NMAX / 100) T1 = ETIME(TARRAY) * start of computation to time DO 10 I = 1, N A(I) = B(I) * C(I) 10 CONTINUE * end of computation T2 = ETIME(TARRAY) T = T2 - T1 - T0 PRINT *, 'N = ', N, ' TIME = ', T, ' seconds' 20 CONTINUE STOP END Fig. 2.7 FORTRAN Code to Collect Data for Hockney s r and n 1/2 Parameters Using r and n 1/2 In the table in Figure 2.8, the r and n 1/2 parameters were measured by a program similar to the one in Figure 2.7. The Crays and the CDC supercomputers are vector processors. The ICL DAP is an SIMD processor array of 4096 simple processors. First, observe the maximum performance parameter r is not the same as the theoretical peak MFLOPS. The parameter r is based on a program run while the theoretical peak performance is a paper calculation (See Section 2.4). Also, this r is for a 64-bit floating- Theoretical Computer r MFLOPS n 1/2 Peak MFLOPS Cray ICL DAP (4096 processors) CDC 205 (2-pipes) The ICL DAP values are for 32-bit floating-point precision. The rest are for 64-bit.

13 2.5 Hockney s Parameters r and n 1/2 47 Cray X-MP/22 (1 processor) Fig. 2.8 r and n 1/2 Measurements for Some Supercomputers. point vector multiply and not 32-bit. Other precisions and vector operators have a different r. For example, the r is 200 MFLOPS for a 32-bit vector multiply on a CDC 205 or double the 64- bit value. The theoretical peak performance counts the number of 64-bit additions as well as multiplies. For example, the Cray-1 can do an add and a multiply every clock period. Therefore, with a 12.5 nanosecond clock, the peak is 160 MFLOPS. In the table of Figure 2.8, notice the large range in the n 1/2 parameter. Recall that n 1/2 is a measure of parallelism or the length of the vector for half maximum performance. Do we desire a large n 1/2? Or a low n 1/2? A more parallel machine should be faster because of the speedup, which implies we want a large n 1/2. However, a computer solves efficiently only those problems with a vector length greater than its n 1/2. Therefore, the higher the value of n 1/2, the more limited is the set of problems that the computer may solve efficiently. A large n 1/2 implies a more special purpose machine. Solving a problem with small vectors on a large n 1/2 machine is wasteful of resources, much like carrying one passenger on a city bus. A low n 1/2 computer can solve more problems effectively or is more general purpose. In conclusion, we want a high r and a low n 1/2 together. Of the machines in the table of Figure 2.8, which one is best? This depends on the application area, the cost of the machine, the r of the machine and other considerations. If your application area has mostly short vectors or is scalar, the Cray-1 with a small n 1/2 would be the best choice. If your application has mostly long vectors, the CDC 205 would be the best choice. With the knowledge of the n 1/2 of a particular machine, we can select or design an algorithm which better matches the machine. Hockney claims the two parameters r and n 1/2 provide us with a quantitative means of comparing the parallelism and maximum performance of all computers. In the next sections, we will explore how effective Hockney s performance model is on computers other than vector processors Extending Hockney s Performance Model Recall that SIMD machines have one instruction unit which issues the same instruction to P processors all executing in lockstep. The ICL DAP in the table of Figure 2.8 is of this class. How well does Hockney s performance model fit SIMD machines? Assume P is the number of processors and n is the length of a vector. If n P then we have more processors than we need and the total time is the time to do one calculation which is independent of the vector length n. For the more common case when n > P, let t pe be the time for one processor to compute the vector operation, e. g., a floating point multiply. Hockney derives the following for SIMD machines: r = P t pe

14 48 CHAPTER 2 MEASURING PERFORMANCE n 1/2 = P 2 For half performance, we need a vector length of P/2 which makes sense as half of the processors would be used. Notice, as we increase P, the number of processors in the processor array, both r and n 1/2 increase linearly. Therefore, an SIMD machine with a large number of processors has a large n 1/2 and tends to be a special purpose machine. For example, the DAP with its 4096 processors has an n 1/2 of Therefore, Hockney s parameters are useful for SIMD machines as well as vector processors. What about MIMD machines? 2.6 Performance of MIMD Computers MIMD computers have multiple instruction units issuing instructions to multiple processing units. Recall that the two main subclasses are shared memory and message passing MIMD. Assuming homogeneous processors, replicating a processor P times multiplies the n 1/2 and r of an individual processor by P. A real serial processor will have a small but non-zero n 1/2 due to loop inefficiencies and other factors. Consequently, a large number of processors implies a large n 1/2. However, parameters such as n 1/2 and r are only part of the story for MIMD computing. Other problems may dominate for MIMD and produce poor performance. According to Hockney, the three main areas of performance problems (overheads) for MIMD computing are the following [Hockney, 1988]: 1) scheduling work among the available processors (or instruction streams) in such a way as to reduce the idle time of the processors waiting for others to finish. 2) synchronizing the processors so that arithmetic operations take place in the correct order. In most MIMD algorithms the results of one portion of the algorithm are required before starting another portion. If many processors are working on the first portion, the processors must synchronize before they can proceed on to the second portion. 3) accessing arguments from memory or other processors. There are many ways to access arguments or data. For example, overhead costs are associated with memory cell conflicts in a shared memory machine; communication between message passing processors; and misses in a cache-primary memory hierarchy. In the above three areas, the accessing of arguments is a potential bottleneck for all computers. For example, the slowness of moving vectors of data from memory to the fast pipelined arithmetic functional units is the main cause of the difference between peak performance rates stated for vector processors and the average performance rates found in realistic user programs. However, the other two areas, scheduling and synchronizing, are new problems introduced by MIMD computing. A vector computer of one processor has no need to synchronize and schedule itself. Synchronization on an SIMD computer is automatic since every instruction is synchronized by the instruction unit. In SIMD computing, all the processors are trivially scheduled to perform the same instruction. Reducing the idle time by scheduling work on MIMD processors is non-trivial. Researchers have studied the problem extensively. We will discuss several techniques later in the book. Efforts to balance the work or load across the processors is called load balancing.

15 2.7 Limits to Performance 49 Synchronization of MIMD processors may be accomplished by special hardware, e g., a semaphore on a shared memory machine, or by the arrival of a message on a message passing MIMD. Processors waiting for synchronization events may be a major source of overhead. Hockney derives performance models and efficiency parameters for scheduling, communication, and synchronization for MIMD computing much like n 1/2 and r. The interested reader should consult his work [Hockney, 1988]. Notice that communication between processors is involved in both accessing of data and synchronization. Many of the first MIMD machines had relatively slow communication mechanisms and experienced performance poor enough to incite controversy The MIMD Performance Controversy In the mid 1980s, the computer architecture research community was debating the practicality of large scale MIMD systems with more than a dozen processors. At conferences, some researchers were presenting papers on thousands of processors. Others were arguing that such systems would have dismal performance and weren t worth building. Minsky s Conjecture Many researchers in the debates were influenced by a 1971 paper authored by Marvin Minsky of MIT [Minsky, 1971]. Basing his analysis on models of performance, Minsky conjectured that realizable speedups on MIMD machines were in the order of log 2 P where P was the number of processors. Using Minsky s conjecture, one could expect a speedup of only about 4 from a system of 16 processors. Experience with existing multiprocessors during this time tended to support Minsky s point. Several researchers felt that Minsky s conjecture was overly pessimistic and presented their own estimates. Basing his analysis on statistical modeling, Hwang estimates the realizable speedup is P/ln P where ln is the natural logarithm [Hwang, 1984]. Number of Processors Minsky s Predicted Speedup Hwang s Estimate P log 2 P P /ln P Fig. 2.9 Minsky and Hwang s Predicted Speedup for MIMD Computers The above analysis explains why in the mid-1980s many computer vendors built multiprocessors consisting only of two or four processors, e. g., the Cray X-MP. Amdahl s Law In 1967, Gene Amdahl [Amdahl, 1967] argued convincingly that one wants fast scalar machines, not MIMD machines. He argued that a small number of sequential operations can effectively limit the speedup of a parallel algorithm. Let f be the fraction of operations in a

16 50 CHAPTER 2 MEASURING PERFORMANCE computation that must be performed sequentially. The maximum speedup achievable by a parallel computer with P processors is the following: maximum speedup = 1 f + (1 - f)/p Amdahl s Law To demonstrate Amdahl s law, consider a program where 10 percent of the operations, i. e., f = 0.1, must be performed sequentially. His law is the following: maximum speedup = /P As the number of processors P increases the term 0.9/P goes to zero. Therefore, Amdahl s law states that the maximum speedup is 10, no matter how many processors available. Karp s Wager In 1986, Alan Karp [Karp, 1986] proposed his famous wager: I have just returned from the Second SIAM Conference on Parallel Processing for Scientific Computing in Norfolk, Virginia. There I heard about 1,000 processor systems, 4,000 processor systems, and even a proposed 1,000,000 processor system. Since I wonder if such systems are the best way to do general-purpose scientific computing, I am making the following offer. I will pay $100 to the first person to demonstrate a speedup of at least 200 on a general-purpose, MIMD computer used for scientific computing. This offer will be withdrawn at 11:59 p m. on December 31, With Minsky s conjecture, Amdahl s law, and his own experience, Karp felt his money was safe. If Minsky is right, one would need processors (That s a lot of processors!) for a speedup of 200. If Hwang is correct, one would need over 2000 processors. Amdahl s law requires less than 0.5% of the code to be sequential for a speedup of 200. To sweeten the pot, C. Gordon Bell, chief architect of the DEC VAX machines, proposed an additional $1000 prize. This has become known as the Gordon Bell Award which recognizes the best contributions to parallel processing, either speedup or throughput, for practical, full-scale problems. This prizes are still awarded annually. To the surprise of many, in March 1988, three researchers, John L. Gustafson, Gary R. Montry and Robert E. Benner, at Sandia National Laboratory demonstrated speedups of up to 1009 to 1020 on an ncube hypercube machine with 1024 processors [Gustafson, 1988]. However, the Sandia group altered the definition of speedup slightly. But even with the accepted definition, they still had speedups of 502 to 637, well over the 200 required for the prizes. The Sandia group argued that the accepted definition was unfair to massively parallel machines. 15 Karp, Alan, What Price Multiplicity?, Communications of the ACM, Vol. 29, No. 2, Feb. 1986, pp. 87.

17 2.7 Limits to Performance 51 Accepted definition of speedup = time for fastest serial algorithm time for parallel algorithm For massively parallel machines (typically over a thousand processors), they argued one should scale the problem size as one increases the number of processors. Sandia s scaled speedup is the same as the accepted definition except the problem size per processor is fixed. In the accepted definition, the same fixed problem size is run on both the serial processor and the P processors. Many have accepted the Sandia group s argument that for massively parallel machines the speedup should be defined with the problem size per processor fixed. The Sandia group s paper is well worth reading Performance of Massively Parallel Machines With the arrival of massively parallel machines in the last couple of years, there is a need to evaluate by benchmarks such machines on problems that make sense. The problem size and rules for the Standard LINPACK benchmark we discussed before do not permit massively parallel computers to demonstrate their potential performance. The basic flaw with the Standard LINPACK benchmark is that solving 100 equations is too small. To provide a forum for comparing such machines, Jack Dongarra [Dongarra, 1992] proposed a new benchmark. This benchmark involves solving a system of linear equations as did LINPACK where the problem size is allowed to increase (as argued by the Sandia group) and the performance numbers reflect the largest problem run on the machine. Dongarra s parameters are based on Hockney s r and n 1/2 parameters, but actually they are defined and measured differently. Instead of the vector length, Dongarra s n 1/2 is based on the problem size which gives half performance. The definitions for the column headings in Figure 2.10 are the following: r max - the performance in GFLOPS for the largest problem run on the machine. n max - the size of the largest problem run on the machine. n 1/2 - the size where half of n max execution rate is achieved. r peak - the theoretical peak performance in GFLOPS. Computer Cycle Time No. of Processors r max n max n 1/2 r peak NEC SX-3/ ns Intel Delta 40 MHz Cray Y-MP C ns TMC CM MHz TMC CM MHz Alliant Campus/ MHz Intel ipsc/ MHz Thinking Machines, Co. (TMC) CM-200 is an SIMD machine. 17 The CM-200 really has 65,536 one-bit processors. However, 32 processors can access a Weitek floating-point processor. Therefore, one way to view the machine is as 2048 floating-point processors. 18 Thinking Machines, Co. (TMC) CM-2 is an SIMD machine. 19 The CM-2 really has 65,536 one-bit processors. However, 32 processors can access a Weitek floating-point processor. Therefore, one way to view the machine is as 2048 floating-point processors.

18 52 CHAPTER 2 MEASURING PERFORMANCE ncube 2 20 MHz MasPar MP ns 16, Fig Results from Dongarra s Benchmark for Massively Parallel Computers 21 With these data, we can compare massively parallel computers. We desire a high r max and a large n max. Since the value of n max is limited by the available memory size, it is an indicator of the size of the memory and the effectiveness of any memory hierarchy. We can use Dongarra s n 1/2, i. e., the size of the problem where half of n max execution rate is achieved, much like Hockney s n 1/2. The Cray Y-MP C90 s n 1/2 of 650 means that many more problems can be solved efficiently than, for example, on the CM-2 with an n 1/2 of Limits to Performance Many of us are amazed at the performance of today s supercomputers. Today s fastest computers can compute 20 GFLOPS or 20 billion floating operations per second. The supercomputing research community is planning the designs of TeraFLOPS computers or ones that can compute at a trillion FLOPS. We wonder if there are any limits to achieving higher and higher performance. In this section, we will first discuss several physical limitations, then later, algorithmic limitations Physical Limits One limitation to the design of computers is the speed of light. Light travels 30 centimeters or 11.8 inches in one nanosecond. Since the clock speed of current supercomputers is only a couple of nanoseconds, e.g., the NEC SX3/44 s clock is 2.9 nanoseconds, all the wires must be short and components must be physically close together, i. e., a matter of inches which implies a dense packing of the circuitry. As logic gates are forced to switch faster, they require more energy to switch. Dense packing of circuits and more switching energy imply a lot of energy dissipated in the form of heat in a concentrated volume. Therefore, how to cool supercomputers is a major engineering design problem. Some machines use water for cooling while others such as the Cray- 1 used Freon, that is commonly found in refrigerators, Observe that the cooling problem exists for both fast scalar processors and parallel machines. Another limitation is the clock speed of integrated circuit chips. This especially limits how fast one can build a scalar processor. For higher performance, designers are forced to use parallel processors. In 1990, Harold Stone stated the following: 20 MasPar MP-1 is an SIMD machine. 21 Dongarra, Jack J., LINPACK Benchmark: Performance of Various Computers Using Standard Linear Equations Software, Supercomputing Review, Vol. 5, No. 3, March, 1992.

Introduction The Nature of High-Performance Computation

Introduction The Nature of High-Performance Computation 1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu Performance Metrics for Computer Systems CASS 2018 Lavanya Ramapantulu Eight Great Ideas in Computer Architecture Design for Moore s Law Use abstraction to simplify design Make the common case fast Performance

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

Performance of Computers. Performance of Computers. Defining Performance. Forecast

Performance of Computers. Performance of Computers. Defining Performance. Forecast Performance of Computers Which computer is fastest? Not so simple scientific simulation - FP performance program development - Integer performance commercial work - I/O Performance of Computers Want to

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law Topics 2 Page The Nature of Time real (i.e. wall clock) time = User Time: time spent

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Topics Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law 2 The Nature of Time real (i.e. wall clock) time = User Time: time spent executing

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

INTENSIVE COMPUTATION. Annalisa Massini

INTENSIVE COMPUTATION. Annalisa Massini INTENSIVE COMPUTATION Annalisa Massini 2015-2016 Course topics The course will cover topics that are in some sense related to intensive computation: Matlab (an introduction) GPU (an introduction) Sparse

More information

Analytical Modeling of Parallel Systems

Analytical Modeling of Parallel Systems Analytical Modeling of Parallel Systems Chieh-Sen (Jason) Huang Department of Applied Mathematics National Sun Yat-sen University Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

CMP 338: Third Class

CMP 338: Third Class CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does

More information

Goals for Performance Lecture

Goals for Performance Lecture Goals for Performance Lecture Understand performance, speedup, throughput, latency Relationship between cycle time, cycles/instruction (CPI), number of instructions (the performance equation) Amdahl s

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Thanks to: University of Illinois at Chicago NSF DMS Alfred P. Sloan Foundation

Thanks to: University of Illinois at Chicago NSF DMS Alfred P. Sloan Foundation Building circuits for integer factorization D. J. Bernstein Thanks to: University of Illinois at Chicago NSF DMS 0140542 Alfred P. Sloan Foundation I want to work for NSA as an independent contractor.

More information

Parallel Performance Theory - 1

Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis

More information

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014 Outline 1 midterm exam on Friday 11 July 2014 policies for the first part 2 questions with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014 Intro

More information

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then: Amdahl's Law Useful for evaluating the impact of a change. (A general observation.) Insight: Improving a feature cannot improve performance beyond the use of the feature Suppose we introduce a particular

More information

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

MICROPROCESSOR REPORT.   THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE MICROPROCESSOR www.mpronline.com REPORT THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE ENERGY COROLLARIES TO AMDAHL S LAW Analyzing the Interactions Between Parallel Execution and Energy Consumption By

More information

Computer Algorithms CISC4080 CIS, Fordham Univ. Outline. Last class. Instructor: X. Zhang Lecture 2

Computer Algorithms CISC4080 CIS, Fordham Univ. Outline. Last class. Instructor: X. Zhang Lecture 2 Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Outline Introduction to algorithm analysis: fibonacci seq calculation counting number of computer steps recursive formula

More information

Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2

Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Outline Introduction to algorithm analysis: fibonacci seq calculation counting number of computer steps recursive formula

More information

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

CRYPTOGRAPHIC COMPUTING

CRYPTOGRAPHIC COMPUTING CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,

More information

Lecture 2: Metrics to Evaluate Systems

Lecture 2: Metrics to Evaluate Systems Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video

More information

Quantum computing with superconducting qubits Towards useful applications

Quantum computing with superconducting qubits Towards useful applications Quantum computing with superconducting qubits Towards useful applications Stefan Filipp IBM Research Zurich Switzerland Forum Teratec 2018 June 20, 2018 Palaiseau, France Why Quantum Computing? Why now?

More information

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018 ECE 172 Digital Systems Chapter 12 Instruction Pipelining Herbert G. Mayer, PSU Status 7/20/2018 1 Syllabus l Scheduling on Pipelined Architecture l Idealized Pipeline l Goal of Scheduling l Causes for

More information

arxiv:astro-ph/ v1 15 Sep 1999

arxiv:astro-ph/ v1 15 Sep 1999 Baltic Astronomy, vol.8, XXX XXX, 1999. THE ASTEROSEISMOLOGY METACOMPUTER arxiv:astro-ph/9909264v1 15 Sep 1999 T.S. Metcalfe and R.E. Nather Department of Astronomy, University of Texas, Austin, TX 78701

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

P vs NP & Computational Complexity

P vs NP & Computational Complexity P vs NP & Computational Complexity Miles Turpin MATH 89S Professor Hubert Bray P vs NP is one of the seven Clay Millennium Problems. The Clay Millenniums have been identified by the Clay Mathematics Institute

More information

Next Genera*on Compu*ng: Needs and Opportuni*es for Weather, Climate, and Atmospheric Sciences. David Randall

Next Genera*on Compu*ng: Needs and Opportuni*es for Weather, Climate, and Atmospheric Sciences. David Randall Next Genera*on Compu*ng: Needs and Opportuni*es for Weather, Climate, and Atmospheric Sciences David Randall Way back I first modified, ran, and analyzed results from an atmospheric GCM in 1972. The model

More information

2. Accelerated Computations

2. Accelerated Computations 2. Accelerated Computations 2.1. Bent Function Enumeration by a Circular Pipeline Implemented on an FPGA Stuart W. Schneider Jon T. Butler 2.1.1. Background A naive approach to encoding a plaintext message

More information

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel? CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?

More information

Lecture 3, Performance

Lecture 3, Performance Repeating some definitions: Lecture 3, Performance CPI MHz MIPS MOPS Clocks Per Instruction megahertz, millions of cycles per second Millions of Instructions Per Second = MHz / CPI Millions of Operations

More information

Lecture 3, Performance

Lecture 3, Performance Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations

More information

Session-Based Queueing Systems

Session-Based Queueing Systems Session-Based Queueing Systems Modelling, Simulation, and Approximation Jeroen Horters Supervisor VU: Sandjai Bhulai Executive Summary Companies often offer services that require multiple steps on the

More information

Summarizing Measured Data

Summarizing Measured Data Summarizing Measured Data 12-1 Overview Basic Probability and Statistics Concepts: CDF, PDF, PMF, Mean, Variance, CoV, Normal Distribution Summarizing Data by a Single Number: Mean, Median, and Mode, Arithmetic,

More information

College Algebra. Word Problems

College Algebra. Word Problems College Algebra Word Problems Example 2 (Section P6) The table shows the numbers N (in millions) of subscribers to a cellular telecommunication service in the United States from 2001 through 2010, where

More information

Qualitative vs Quantitative metrics

Qualitative vs Quantitative metrics Qualitative vs Quantitative metrics Quantitative: hard numbers, measurable Time, Energy, Space Signal-to-Noise, Frames-per-second, Memory Usage Money (?) Qualitative: feelings, opinions Complexity: Simple,

More information

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing:

More information

Molecular Dynamics Simulations

Molecular Dynamics Simulations MDGRAPE-3 chip: A 165- Gflops application-specific LSI for Molecular Dynamics Simulations Makoto Taiji High-Performance Biocomputing Research Team Genomic Sciences Center, RIKEN Molecular Dynamics Simulations

More information

Algebra II. In this technological age, mathematics is more important than ever. When students

Algebra II. In this technological age, mathematics is more important than ever. When students In this technological age, mathematics is more important than ever. When students leave school, they are more and more likely to use mathematics in their work and everyday lives operating computer equipment,

More information

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.

More information

Unit 3 NOTES Honors Math 2 21

Unit 3 NOTES Honors Math 2 21 Unit 3 NOTES Honors Math 2 21 Warm Up: Exponential Regression Day 8: Point Ratio Form When handed to you at the drive-thru window, a cup of coffee was 200 o F. Some data has been collected about how the

More information

Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc.

Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc. Finite State Machines Introduction Let s now begin to formalize our analysis of sequential machines Powerful methods for designing machines for System control Pattern recognition Etc. Such devices form

More information

Formal verification of IA-64 division algorithms

Formal verification of IA-64 division algorithms Formal verification of IA-64 division algorithms 1 Formal verification of IA-64 division algorithms John Harrison Intel Corporation IA-64 overview HOL Light overview IEEE correctness Division on IA-64

More information

Econ A Math Survival Guide

Econ A Math Survival Guide Econ 101 - A Math Survival Guide T.J. Rakitan Teaching Assistant, Summer Session II, 013 1 Introduction The introductory study of economics relies on two major ways of thinking about the world: qualitative

More information

2 The Way Science Works

2 The Way Science Works CHAPTER 1 Introduction to Science 2 The Way Science Works SECTION KEY IDEAS As you read this section, keep these questions in mind: How can you use critical thinking to solve problems? What are scientific

More information

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance Metrics & Architectural Adaptivity ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So What are the Options? Power Consumption Activity factor (amount of circuit switching) Load Capacitance (size

More information

CSE 241 Class 1. Jeremy Buhler. August 24,

CSE 241 Class 1. Jeremy Buhler. August 24, CSE 41 Class 1 Jeremy Buhler August 4, 015 Before class, write URL on board: http://classes.engineering.wustl.edu/cse41/. Also: Jeremy Buhler, Office: Jolley 506, 314-935-6180 1 Welcome and Introduction

More information

CS 100: Parallel Computing

CS 100: Parallel Computing CS 100: Parallel Computing Chris Kauffman Week 12 Logistics Upcoming HW 5: Due Friday by 11:59pm HW 6: Up by Early Next Week Moore s Law: CPUs get faster Smaller transistors closer together Smaller transistors

More information

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun spcl.inf.ethz.ch @spcl_eth Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun Design of Parallel and High-Performance Computing Fall 2017 DPHPC Overview cache coherency memory models 2 Speedup An application

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

COPYRIGHT NOTICE: L. Ridgway Scott, Terry Clark, and Babak Bagheri: Scientific Parallel Computing

COPYRIGHT NOTICE: L. Ridgway Scott, Terry Clark, and Babak Bagheri: Scientific Parallel Computing COPYRIGHT NOTICE: L. Ridgway Scott, Terry Clark, and Babak Bagheri: Scientific Parallel Computing is published by Princeton University Press and copyrighted, 2005, by Princeton University Press. All rights

More information

EECS150 - Digital Design Lecture 27 - misc2

EECS150 - Digital Design Lecture 27 - misc2 EECS150 - Digital Design Lecture 27 - misc2 May 1, 2002 John Wawrzynek Spring 2002 EECS150 - Lec27-misc2 Page 1 Outline Linear Feedback Shift Registers Theory and practice Simple hardware division algorithms

More information

Introduction. An Introduction to Algorithms and Data Structures

Introduction. An Introduction to Algorithms and Data Structures Introduction An Introduction to Algorithms and Data Structures Overview Aims This course is an introduction to the design, analysis and wide variety of algorithms (a topic often called Algorithmics ).

More information

Practical Atmospheric Analysis

Practical Atmospheric Analysis Chapter 12 Practical Atmospheric Analysis With the ready availability of computer forecast models and statistical forecast data, it is very easy to prepare a forecast without ever looking at actual observations,

More information

Unit 2 Modeling with Exponential and Logarithmic Functions

Unit 2 Modeling with Exponential and Logarithmic Functions Name: Period: Unit 2 Modeling with Exponential and Logarithmic Functions 1 2 Investigation : Exponential Growth & Decay Materials Needed: Graphing Calculator (to serve as a random number generator) To

More information

COMPUTER SCIENCE TRIPOS

COMPUTER SCIENCE TRIPOS CST0.2017.2.1 COMPUTER SCIENCE TRIPOS Part IA Thursday 8 June 2017 1.30 to 4.30 COMPUTER SCIENCE Paper 2 Answer one question from each of Sections A, B and C, and two questions from Section D. Submit the

More information

Chapter 1 :: From Zero to One

Chapter 1 :: From Zero to One Chapter 1 :: From Zero to One Digital Design and Computer Architecture David Money Harris and Sarah L. Harris Copyright 2007 Elsevier 1- Chapter 1 :: Topics Background The Game Plan The Art of Managing

More information

Intro To Digital Logic

Intro To Digital Logic Intro To Digital Logic 1 Announcements... Project 2.2 out But delayed till after the midterm Midterm in a week Covers up to last lecture + next week's homework & lab Nick goes "H-Bomb of Justice" About

More information

Deutscher Wetterdienst

Deutscher Wetterdienst Deutscher Wetterdienst The Enhanced DWD-RAPS Suite Testing Computers, Compilers and More? Ulrich Schättler, Florian Prill, Harald Anlauf Deutscher Wetterdienst Research and Development Deutscher Wetterdienst

More information

Adders, subtractors comparators, multipliers and other ALU elements

Adders, subtractors comparators, multipliers and other ALU elements CSE4: Components and Design Techniques for Digital Systems Adders, subtractors comparators, multipliers and other ALU elements Instructor: Mohsen Imani UC San Diego Slides from: Prof.Tajana Simunic Rosing

More information

Table of Content. Chapter 11 Dedicated Microprocessors Page 1 of 25

Table of Content. Chapter 11 Dedicated Microprocessors Page 1 of 25 Chapter 11 Dedicated Microprocessors Page 1 of 25 Table of Content Table of Content... 1 11 Dedicated Microprocessors... 2 11.1 Manual Construction of a Dedicated Microprocessor... 3 11.2 FSM + D Model

More information

a. If the vehicle loses 12% of its value annually, it keeps 100% - 12% =? % of its value. Because each year s value is a constant multiple of

a. If the vehicle loses 12% of its value annually, it keeps 100% - 12% =? % of its value. Because each year s value is a constant multiple of Lesson 9-2 Lesson 9-2 Exponential Decay Vocabulary exponential decay depreciation half-life BIG IDEA When the constant growth factor in a situation is between 0 and 1, exponential decay occurs. In each

More information

The Performance Evolution of the Parallel Ocean Program on the Cray X1

The Performance Evolution of the Parallel Ocean Program on the Cray X1 The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott

More information

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1> Chapter 5 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 5 Chapter 5 :: Topics Introduction Arithmetic Circuits umber Systems Sequential Building

More information

Lesson 8: Using If-Then Moves in Solving Equations

Lesson 8: Using If-Then Moves in Solving Equations Student Outcomes Students understand and use the addition, subtraction, multiplication, division, and substitution properties of equality to solve word problems leading to equations of the form and where,,

More information

Chapter 13 - Inverse Functions

Chapter 13 - Inverse Functions Chapter 13 - Inverse Functions In the second part of this book on Calculus, we shall be devoting our study to another type of function, the exponential function and its close relative the Sine function.

More information

Parallel Performance Theory

Parallel Performance Theory AMS 250: An Introduction to High Performance Computing Parallel Performance Theory Shawfeng Dong shaw@ucsc.edu (831) 502-7743 Applied Mathematics & Statistics University of California, Santa Cruz Outline

More information

Efficiency of Dynamic Load Balancing Based on Permanent Cells for Parallel Molecular Dynamics Simulation

Efficiency of Dynamic Load Balancing Based on Permanent Cells for Parallel Molecular Dynamics Simulation Efficiency of Dynamic Load Balancing Based on Permanent Cells for Parallel Molecular Dynamics Simulation Ryoko Hayashi and Susumu Horiguchi School of Information Science, Japan Advanced Institute of Science

More information

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE

More information

Mathematics Level D: Lesson 2 Representations of a Line

Mathematics Level D: Lesson 2 Representations of a Line Mathematics Level D: Lesson 2 Representations of a Line Targeted Student Outcomes Students graph a line specified by a linear function. Students graph a line specified by an initial value and rate of change

More information

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic Computer Science 324 Computer Architecture Mount Holyoke College Fall 2007 Topic Notes: Digital Logic Our goal for the next few weeks is to paint a a reasonably complete picture of how we can go from transistor

More information

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II TDDB68 Concurrent programming and operating systems Lecture: CPU Scheduling II Mikael Asplund, Senior Lecturer Real-time Systems Laboratory Department of Computer and Information Science Copyright Notice:

More information

Algebra II. Slide 1 / 261. Slide 2 / 261. Slide 3 / 261. Linear, Exponential and Logarithmic Functions. Table of Contents

Algebra II. Slide 1 / 261. Slide 2 / 261. Slide 3 / 261. Linear, Exponential and Logarithmic Functions. Table of Contents Slide 1 / 261 Algebra II Slide 2 / 261 Linear, Exponential and 2015-04-21 www.njctl.org Table of Contents click on the topic to go to that section Slide 3 / 261 Linear Functions Exponential Functions Properties

More information

Unit 8: Exponential & Logarithmic Functions

Unit 8: Exponential & Logarithmic Functions Date Period Unit 8: Eponential & Logarithmic Functions DAY TOPIC ASSIGNMENT 1 8.1 Eponential Growth Pg 47 48 #1 15 odd; 6, 54, 55 8.1 Eponential Decay Pg 47 48 #16 all; 5 1 odd; 5, 7 4 all; 45 5 all 4

More information

Analysis of Algorithms

Analysis of Algorithms Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and M. H. Goldwasser, Wiley, 2014 Analysis of Algorithms Input Algorithm Analysis

More information

Inference with Simple Regression

Inference with Simple Regression 1 Introduction Inference with Simple Regression Alan B. Gelder 06E:071, The University of Iowa 1 Moving to infinite means: In this course we have seen one-mean problems, twomean problems, and problems

More information

Update on Cray Earth Sciences Segment Activities and Roadmap

Update on Cray Earth Sciences Segment Activities and Roadmap Update on Cray Earth Sciences Segment Activities and Roadmap 31 Oct 2006 12 th ECMWF Workshop on Use of HPC in Meteorology Per Nyberg Director, Marketing and Business Development Earth Sciences Segment

More information

Design of Sequential Circuits

Design of Sequential Circuits Design of Sequential Circuits Seven Steps: Construct a state diagram (showing contents of flip flop and inputs with next state) Assign letter variables to each flip flop and each input and output variable

More information

ALU A functional unit

ALU A functional unit ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1

More information

Swedish Meteorological and Hydrological Institute

Swedish Meteorological and Hydrological Institute Swedish Meteorological and Hydrological Institute Norrköping, Sweden 1. Summary of highlights HIRLAM at SMHI is run on a CRAY T3E with 272 PEs at the National Supercomputer Centre (NSC) organised together

More information

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

Knott, M. May Future t r e n d s

Knott, M. May Future t r e n d s 0.S'T 1 Knott, M. May 1 9 9 0 M. K r a h e r, and F. Lenkszus A P S CONTROL SYSTEM OPERATING SYSTEM CHOICE Contents: Introduction What i s t h e o p e r a t i n g system? A P S c o n t r o l system a r

More information

Prime Clocks. Michael Stephen Fiske. 10th GI Conference on Autonomous Systems October 23, AEMEA Institute. San Francisco, California

Prime Clocks. Michael Stephen Fiske. 10th GI Conference on Autonomous Systems October 23, AEMEA Institute. San Francisco, California Prime Clocks Michael Stephen Fiske 10th GI Conference on Autonomous Systems October 23, 2017 AEMEA Institute San Francisco, California Motivation for Prime Clocks The mindset of mainstream computer science

More information

CMOS Ising Computer to Help Optimize Social Infrastructure Systems

CMOS Ising Computer to Help Optimize Social Infrastructure Systems FEATURED ARTICLES Taking on Future Social Issues through Open Innovation Information Science for Greater Industrial Efficiency CMOS Ising Computer to Help Optimize Social Infrastructure Systems As the

More information

Massachusetts Tests for Educator Licensure (MTEL )

Massachusetts Tests for Educator Licensure (MTEL ) Massachusetts Tests for Educator Licensure (MTEL ) BOOKLET 2 Mathematics Subtest Copyright 2010 Pearson Education, Inc. or its affiliate(s). All rights reserved. Evaluation Systems, Pearson, P.O. Box 226,

More information

Unit 3: Linear and Exponential Functions

Unit 3: Linear and Exponential Functions Unit 3: Linear and Exponential Functions In Unit 3, students will learn function notation and develop the concepts of domain and range. They will discover that functions can be combined in ways similar

More information

Introduction to Computer Science and Programming for Astronomers

Introduction to Computer Science and Programming for Astronomers Introduction to Computer Science and Programming for Astronomers Lecture 8. István Szapudi Institute for Astronomy University of Hawaii March 7, 2018 Outline Reminder 1 Reminder 2 3 4 Reminder We have

More information

Performance and Scalability. Lars Karlsson

Performance and Scalability. Lars Karlsson Performance and Scalability Lars Karlsson Outline Complexity analysis Runtime, speedup, efficiency Amdahl s Law and scalability Cost and overhead Cost optimality Iso-efficiency function Case study: matrix

More information

Chap 4. Software Reliability

Chap 4. Software Reliability Chap 4. Software Reliability 4.2 Reliability Growth 1. Introduction 2. Reliability Growth Models 3. The Basic Execution Model 4. Calendar Time Computation 5. Reliability Demonstration Testing 1. Introduction

More information

Appendix A. Linear Relationships in the Real World Unit

Appendix A. Linear Relationships in the Real World Unit Appendix A The Earth is like a giant greenhouse. The sun s energy passes through the atmosphere and heats up the land. Some of the heat escapes back into space while some of it is reflected back towards

More information

ECE321 Electronics I

ECE321 Electronics I ECE321 Electronics I Lecture 1: Introduction to Digital Electronics Payman Zarkesh-Ha Office: ECE Bldg. 230B Office hours: Tuesday 2:00-3:00PM or by appointment E-mail: payman@ece.unm.edu Slide: 1 Textbook

More information

Chapter 2 - Lessons 1 & 2 Studying Geography, Economics

Chapter 2 - Lessons 1 & 2 Studying Geography, Economics Chapter 2 - Lessons 1 & 2 Studying Geography, Economics How does geography influence the way people live? Why do people trade? Why do people form governments? Lesson 1 - How Does Geography Influence the

More information