Computer Architecture. ESE 345 Computer Architecture. Performance and Energy Consumption. CA: Performance and Energy

Size: px

Start display at page:

Download "Computer Architecture. ESE 345 Computer Architecture. Performance and Energy Consumption. CA: Performance and Energy"

Gary Taylor
5 years ago
Views:

1 Computer Architecture ESE 345 Computer Architecture Performance and Energy Consumption 1

2 Two Notions of Performance Plane Boeing 747 DC to Paris 6.5 hours Top Speed 610 mph Passengers Throughput (pmph) ,700 BAD/Sud Concorde hours ,200 mph Which has higher performance? Time to deliver 1 passenger? Time to deliver 400 passengers? In a computer, time for 1 job called Response Time or Execution Time In a computer, jobs per day called Throughput or Bandwidth 2

3 Definition: Performance Performance is in units of things per sec bigger is better If we are primarily concerned with response time Measuring Dependability Performance performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = = Performance(Y) Execution_time(Y) Execution_time(X) 3

4 What is Time? Straightforward definition of time: Total time to complete a task, including disk accesses, memory accesses, I/O activities, operating system overhead,... real time, response time or elapsed time Alternative: just time processor (CPU) is working only on your program (since multiple processes running at same time) CPU execution time or CPU time Often divided into system CPU time (in OS) and user CPU time (in user program) 4

5 Analyze the Right Measurement! 5

6 How to Measure Time? User Time seconds CPU Time: Computers constructed using a clock that runs at constant rate These discrete time intervals called clock cycles (or informally clocks or cycles) Length of clock period: clock cycle time (e.g., 250 picoseconds or 250 ps) and clock rate (e.g., 4 gigahertz, or 4 GHz), which is the inverse of the clock period; use these! 6

7 Measuring Time using Clock Cycles (1/2) CPU execution time for program = Clock Cycles for a program x Clock Cycle Time or = Clock Cycles for a program Clock Rate 7

8 Measuring Time using Clock Cycles (2/2) One way to define clock cycles: Clock Cycles for program = Instructions for a program (called Instruction Count ) x Average Clock cycles Per Instruction (abbreviated CPI ) CPI one way to compare two machines with same instruction set, since Instruction Count would be the same 8

9 Performance Calculation (1/2) CPU execution time for program = Clock Cycles for program x Clock Cycle Time Substituting for clock cycles: CPU execution time for program = (Instruction Count x CPI) x Clock Cycle Time = Instruction Count x CPI x Clock Cycle Time 9

10 CPU Performance Law The Processor Performance Equation Principles 10

11 How Calculate the 3 Components? Clock Cycle Time: in specification of computer (Clock Rate in advertisements) Instruction Count: Count instructions in loop of small program Use simulator to count instructions Hardware counter in spec. register (most CPUs) CPI: Calculate: Execution Time / Clock cycle time Instruction Count Hardware counter in special register (most CPUs) 11

12 Calculating CPI Another Way First calculate CPI for each individual instruction (add, sub, and, etc.) Next calculate frequency of each individual instruction Finally multiply these two for each instruction and add them up to get final CPI 12

13 Principles of Computer Design Different instruction types having different CPIs Principles 13

14 Example Op Freq i CPI i Prod (% Time) ALU 50% 1.5 (33%) Load 20% 2.4 (27%) Store 10% 2.2 (13%) Branch 20% 2.4 (27%) Instruction Mix 1.5 (Where time spent) What if Branch instructions twice as fast? 14

15 Processor Performance Equation CPI Principles inst count Cycle time CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X X Organization X X Technology X 15

16 What Programs Measure for Comparison? Ideally run typical programs with typical input before purchase, or before even build machine Called a workload ; For example: Engineer uses compiler, spreadsheet Author uses word processor, drawing program, compression software In some situations its hard to do Don t have access to machine to benchmark before purchase Don t know workload in future 16

17 Benchmarks Obviously, apparent speed of processor depends on code used to test it Need industry standards so that different processors can be fairly compared Companies exist that create these benchmarks: typical code used to evaluate systems Need to be changed every 2 or 3 years since designers could target these standard benchmarks 17

18 Example Standardized Workload Benchmarks Workstations: Standard Performance Evaluation Corporation (SPEC) SPEC95: 8 integer (gcc, compress, li, ijpeg, perl,...) & 10 floating-point (FP) programs (hydro2d, mgrid, applu, turbo3d,...) SPEC2000: 11 integer (gcc, bzip2, ), 18 FP (mgrid, swim, ma3d, ) Separate average for integer and FP Benchmarks distributed in source code Company representatives select workload Compiler, machine designers target benchmarks, so try to change every 3 years 18

19 SPEC CPU Benchmark Generations Measuring Performance 19

20 How Summarize Suite Performance (1/4) Arithmetic average of execution time of all programs? But they vary by 4X in speed, so some would be more important than others in arithmetic average Could add a weights per program, but how pick weight? Different companies want different weights for their products SPECRatio: Normalize execution times to reference computer, yielding a ratio proportional to performance = time on reference computer time on computer being rated Measuring Performance 20

21 How Summarize Suite Performance (2/4) If program SPECRatio on Computer A is 1.25 times bigger than Computer B, then 1.25 SPECRatio SPECRatio A B ExecutionTime ExecutionTime ExecutionTime ExecutionTime ExecutionTime B A ExecutionTime Performance Performance reference A reference Note that when comparing 2 computers as a ratio, execution times on the reference computer drop out, so choice of reference computer is irrelevant B A B Measuring Performance 21

22 How Summarize Suite Performance (3/4) Since ratios, proper mean is geometric mean (SPECRatio unitless, so arithmetic mean meaningless) GeometricMean n n i1 SPECRatio i Measuring Performance 1. Geometric mean of the ratios is the same as the ratio of the geometric means 2. Ratio of geometric means = Geometric mean of performance ratios choice of reference computer is irrelevant! These two points make geometric mean of ratios attractive to summarize performance 22

23 This image cannot currently be displayed. How Summarize Suite Performance (4/4) Does a single mean well summarize performance of programs in benchmark suite? Can decide if mean a good predictor by characterizing variability of distribution using standard deviation Like geometric mean, geometric standard deviation is multiplicative rather than arithmetic Can simply take the logarithm of SPECRatios, compute the standard mean and standard deviation, and then take the exponent to convert back: Measuring Performance The geometric standard deviation, denoted by σ g, is calculated as follows: log σ g =[1/n n i=1(logx i logg) 2 ] 1/2. where G= n x 1 x 2 x n is the geometric mean of SPECRatios (x 1. x n ). 23

24 How Summarize Suite Performance (5/5) Standard deviation is more informative if know distribution has a standard form bell-shaped normal distribution, whose data are symmetric around mean lognormal distribution, where logarithms of data-- not data itself--are normally distributed (symmetric) on a logarithmic scale For a lognormal distribution, we expect that 68% of samples fall in range mean / gstdev, mean 95% of samples fall in range mean gstdev 2, mean Note: Excel provides functions EXP(), LN(), and STDEV() that make calculating geometric mean and multiplicative standard deviation easy gstdev / gstdev 2 CA, Lec 02 Technology trends 24

25 Example Standard Deviation: (1/3) GM and multiplicative StDev of SPECfp2000 for Itanium SPECfpRatio Outside 1 StDev GM = 2712 GStDev = wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi Itanium 2 is 2712/100 times as fast as Sun Ultra 5 (GM), & range within 1 Std. Deviation is [13.72, 53.62] 25

26 Example Standard Deviation : (2/3) GM and multiplicative StDev of SPECfp2000 for AMD Athlon SPECfpRatio Outside 1 StDev GM = 2086 GStDev = 1.40 wupwise swim mgrid applu mesa galgel art equake facerec ammp lucas fma3d sixtrack apsi Athon is 2086/100 times as fast as Sun Ultra 5 (GM), & range within 1 Std. Deviation is [14.94, 29.11] 26

27 Ratio Itanium 2 v. Athlon for SPECfp2000 Example Standard Deviation (3/3) GM and StDev Itanium 2 v Athlon wupwise swim mgrid applu mesa galgel Outside 1 StDev art equake facerec GM = 1.30 GStDev = 1.74 ammp lucas fma3d sixtrack apsi Exec. Time SPECratio Ratio execution times (At/It) = Ratio of SPECratios (It/At) Itanium X Athlon (GM), 1 St.Dev. Range [0.75,2.27] 27

28 Comments on Itanium 2 and Athlon Standard deviation of 1.98 for Itanium 2 is much higher-- vs so results will differ more widely from the mean, and therefore are likely less predictable Falling within one standard deviation: 10 of 14 benchmarks (71%) for Itanium 2 11 of 14 benchmarks (78%) for Athlon Thus, the results are quite compatible with a lognormal distribution (expect 68%) Itanium 2 vs. Athlon St.Dev is 1.74, which is high, so less confidence in claim that Itanium 1.30 times as fast as Athlon Indeed, Athlon faster on 6 of 14 programs Range is [0.75,2.27] with 11/14 inside 1 StDev (78%) 28

29 Amdahl s Law Principles ExTime w/ Enh. ExTime w/o Enh. Fraction enhanced 1 Fraction enhanced Speedup enhanced Speedup overall ExTime ExTime w/o Enh. w/ Enh. 1 Fraction enhanced 1 Fraction Speedup enhanced enhanced Best you could ever hope to do: Speedup maximum Fraction enhanced F 29

30 Amdahl s Law Example Principles New CPU 10X faster I/O bound server, so 60% time waiting for I/O Speedup overall 1 1 Fraction enhanced 1 Fraction Speedup enhanced enhanced Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster 30

31 Question Speedup = 1 (1 - F) + F S Question: Suppose a program spends 80% of its time in a square root routine. How much must you speed up square root to make the program run 5 times faster? (A) (B) (C) (D) None of the above 31

32 Consequence of Amdahl s Law The amount of speedup that can be achieved through parallelism is limited by the non-parallel portion of your program! Time Parallel portion Speedup Serial portion Number of Processors Number of Processors 32

33 Parallel Speed-up Examples (1/3) Speedup w/ E = 1 / [ (1-F) + F/S ] Consider an enhancement which runs 20 times faster but which is only usable 15% of the time Speedup = 1/( /20) = What if it s usable 25% of the time? Speedup = 1/( /20) = Amdahl s Law tells us that to achieve linear speedup with more processors, none of the original computation can be scalar (non-parallelizable) Nowhere near 20x speedup! To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0.1% or less Speedup = 1/( /100) =

34 Parallel Speed-up Examples (2/3) Z 1 + Z Z 10 X 1,1 X 1,10.. X 10,1 X 10,10. + Y 1,1 Y 1,10. Y 10,1.. Y 10,10 Partition 10 ways and perform on 10 parallel processing units Non-parallel part Parallel part 10 scalar operations (non-parallelizable) 100 parallelizable operations Say, element-wise addition of two 10x10 matrices. 110 operations 100/110 =.909 Parallelizable, 10/110 = Scalar 34

35 Parallel Speed-up Examples (3/3) Speedup w/ E = 1 / [ (1-F) + F/S ] Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors Speedup = 1/( /10) = 1/ = 5.5 What if there are 100 processors? Speedup = 1/( /100) = 1/ = 10.0 What if the matrices are 100 by 100 (or 10,010 adds in total) on 10 processors? Speedup = 1/( /10) = 1/ = 9.9 What if there are 100 processors? Speedup = 1/( /100) = 1/ = 91 35

36 Strong and Weak Scaling To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem Strong scaling: When speedup is achieved on a parallel processor without increasing the size of the problem Weak scaling: When speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors Load balancing is another important factor: every processor doing same amount of work Just 1 unit with twice the load of others cuts speedup almost in half (bottleneck!) 36

37 Other Performance Metrics MIPS Million Instructions Per Second Instruction _ count MIPS 6 Time( s) 10 MFLOPS - Million Floating-point Operations Per Second Floating _ point _ ops / program MFLOPS 6 Time( s) 10 PetaFLOPS Floating-point Operations Per Second Floating _ point _ ops / program PFLOPS 15 Time( s) 10 37

38 Top 5 supercomputers (TOP500, June 2017) Rank System Cores Rmax (TFlop/s) Power (kw) 1 Sunway TaihuLight - Sunway MPP, Sunway 10,649,600 93, ,371 SW C 1.45GHz, Sunway, NRCPC National Supercomputing Center in Wuxi China 2 Tianhe-2 (Milky Way-2) - TH-IVB-FEP Cluster, 3,120,000 33, ,808 Intel Xeon E C 2.200GHz, TH Express- 2, Intel Xeon Phi 31S1P, NUDT National Super Computer Center in Guangzhou China 3 Piz Daint - Cray XC50, Xeon E5-2690v3 12C 361,760 19, , GHz, Aries interconnect, NVIDIA Tesla P100, Cray Inc. Swiss National Supercomputing Centre (CSCS) Switzerland 4 Titan - Cray XK7, Opteron C 2.200GHz, 560,640 17, ,209 Cray Gemini interconnect, NVIDIA K20x, Cray Inc. DOE/SC/Oak Ridge National Laboratory United States 5 Sequoia - BlueGene/Q, Power BQC 16C 1.60 GHz, Custom, IBM DOE/NNSA/LLNL United States 1,572,864 17, ,890 38

39 Top 5 supercomputers (TOP500, June 2018) Rank System 1 Summit - IBM Power System AC922, IBM POWER9 22C 3.07GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband, IBM DOE/SC/Oak Ridge National Laboratory United States 2 Sunway TaihuLight - Sunway MPP, Sunway SW C 1.45GHz, Sunway, NRCPC National Supercomputing Center in Wuxi China 3 Sierra - IBM Power System S922LC, IBM POWER9 22C 3.1GHz, NVIDIA Volta GV100, Dualrail Mellanox EDR Infiniband, IBM DOE/NNSA/LLNL United States 4 Tianhe-2A (Milky Way -2A)- TH-IVB-FEP Cluster, Intel Xeon E5-2692v2 12C 2.2GHz, TH Express-2, Matrix-2000, NUDT National Super Computer Center in Guangzhou China 5 AI Bridging Cloud Infrastructure (ABCI) - PRIMERGY CX2550 M4, Xeon Gold C 2.4GHz, NVIDIA Tesla V100 SXM2, Infiniband EDR, Fujitsu National Institute of Advanced Industrial Science and Technology (AIST) Japan Rmax Power Cores (TFlop/s) (kw) 2,282, , ,806 10,649,600 93, ,371 1,572,480 71,610.0? 4,981,760 61, , ,680 19, ,649 39

40 Top 5 supercomputers (HPCG, June 2018) The High-Performance Conjugate Gradient (HPCG) results, an alternative (sparse-matrix) computing benchmark Rank System Cores 1 Summit Oak Ridge National Laboratory, U.S.A. 2 Sierra Lawrence Livermore National Laboratory, U.S.A. 3 K computer Riken Advanced Institute for Computational Science, Japan 4 Trinity Los Alamos National Laboratory, U.S.A. 5 Piz Daint Swiss National Supercomputing Centre, Switzerland Performance (petaflops)

41 Energy and Power Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage 2 Dynamic power ½ x Capacitive load x Voltage 2 x Frequency switched Reducing clock rate reduces power, not energy Static power consumption Current static x Voltage Scales with number of transistors To reduce: power gating Trends in Power and Energy 41

42 Switching Energy: Fundamental Physics Every logic transition dissipates energy. V dd V dd C E 0->1 = 1 2 C V 2 dd Strong result: Independent of technology. E 1->0 = 1 2 C V 2 dd How can we limit switching energy? (1) Reduce # of clock transitions. But we have work to do... (2) Reduce Vdd. But lowering Vdd limits the clock speed... (3) Fewer circuits. But more transistors can do more work. (4) Reduce C per node. One reason why we scale processes. 42

We can engineer any Ioff we like, but a lower Ioff also results in a lower Ion, and thus a lower maximum clock speed.

43 0V = Second Factor: Leakage Currents Even when a logic gate isn t switching, it burns power. Isub: Even when this nfet is off, it passes an Ioff leakage current. We can engineer any Ioff we like, but a lower Ioff also results in a lower Ion, and thus a lower maximum clock speed. Igate: Ideal capacitors have zero DC current. But modern transistor gates are a few atoms thick, and are not ideal. Intel s 2006 processor designs, leakage vs switching power Bill Holt, Intel, Hot Chips 17. A lot of work was done to get a ratio this good... 50/50 is common. 43

44 Device Engineers Trade Speed and Power We can reduce CV (Pactive) by lowering Vdd. We can increase speed by raising Vdd and lowering Vt. We can reduce leakage (Pstandby) by raising Vt. 2 From: Silicon Device Scaling to the Sub-10-nm Regime Meikei Ieong, 1* Bruce Doris, 2 Jakub Kedzierski, 1 Ken Rim, 1 Min Yang 1 44

45 Customize Processes for Product Types... From: Facing the Hot Chips Challenge Again, Bill Holt, Intel, presented at Hot Chips 17,

46 Power Intel consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air Trends in Power and Energy 46

47 Example of Quantifying Power Suppose 15% reduction in voltage results in a 15% reduction in frequency. What is impact on dynamic power? Trends in Power and Energy Power dynamic 2 1/ 2 CapacitiveLoad Voltage FrequencySwitched 2 1/ 2.85 CapacitiveLoad (.85Voltage) FrequencySwitched 3 (.85) OldPowerdynamic 0.6 OldPowerdynamic 47

48 Acknowledgements These slides contain material developed and copyright by: Morgan Kauffmann (Elsevier, Inc.) Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) Justin Hsia (UCB) 48

Measurement & Performance

Measurement & Performance Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law Topics 2 Page The Nature of Time real (i.e. wall clock) time = User Time: time spent