Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance? Time needed to finish certain task(s) Number of tasks finished per unit time Latency Throughput PERFORMNE EVLUTION H. So, Sp10 Lecture 3 - ELE8106/6102 3 H. So, Sp10 Lecture 3 - ELE8106/6102 4 Latency vs Throughput (1) Low latency High throughput? High throughput Low Latency? High latency low throughput? Low throughput high latency? Latency vs Throughput (2) omputer 1 Finish task takes 15s takes 20s takes 50s omputer 1 and 2 must finish task,, Latency = 15s + 20s+ 50s = 85s Throughput = 3 / 85s = 0.035 tasks / s omputer 2 Finish task takes 20s takes 25s takes 45s Latency = 20s + 25s + 45s = 90s Throughput = 3 / 90s = 0.03 tasks/s Is omputer 1 faster than omputer 2? H. So, Sp10 Lecture 3 - ELE8106/6102 5 H. So, Sp10 Lecture 3 - ELE8106/6102 6
Latency vs Throughput (3) What if omputer 2 can perform 3 tasks at the same time? omputer 1 omputer 2 Finish task takes 15s takes 20s takes 50s Latency = 15s + 20s+ 50s = 85s Throughput = 3 / 85s = 0.035 tasks / s Finish task takes 20s takes 25s takes 45s Latency = 45s Throughput = 3 / 45s = Is omputer 2 faster than omputer 1? H. So, Sp10 Lecture 3 - ELE8106/6102 7 Latency vs Throughput (4) What if both omputer 1 and 2 can perform 2 tasks at the same time? omputer 1 :15, :20, :50 Latency = 50s Throughput = 3 / 50s = 0.06 tasks / s omputer 2 :20, :25, :45 Latency = 45s Which computer is faster? Throughput = 3 / 45s = H. So, Sp10 Lecture 3 - ELE8106/6102 8 Latency vs Throughput (5) oth omputer 1 and 2 can perform 2 tasks at the same time. Define latency as time to get first result. omputer 1 :15, :20, :50 First result = 15s Last result = 50s Throughput = 3 / 50s = 0.06 tasks / s omputer 2 :20, :25, :45 First result = 20s Last result = 45s Throughput = 3 / 45s = Latency vs Throughput (6) oth omputer 1 and 2 can perform 2 tasks at the same time. Tasks = omputer 1 :15, :20, :50 First result = 15s Last result = 85s Throughput = 6 / 85s = 0.07 tasks / s omputer 2 :20, :25, :45 First result = 20s Last result = 90s Throughput = 3 / 45s = H. So, Sp10 Lecture 3 - ELE8106/6102 9 H. So, Sp10 Lecture 3 - ELE8106/6102 10 Latency vs Throughput Summary Latency Time to first data/ response arrive Time for task to finish Indicates the responsiveness of a system Throughput Sustained rate of task completion Matters most when there are a lot of continuous input Especially with streaming input long term efficiency measurement Latency vs Throughput Summary Latency and throughput measure important in different scenarios The two has close tie to each other, but no obvious relationship Many factors affect latency/throughput Data input / Workload Scheduling etc H. So, Sp10 Lecture 3 - ELE8106/6102 11 H. So, Sp10 Lecture 3 - ELE8106/6102 12
Performance: task completion Time to complete 1 task is a good way to measure general purpose computers Time to complete 1 task (latency): How to improve speed? no. of instrs PI no. of instrs PI Decrease number of instruction Increase clock frequency Decrease cycles per instruction H. So, Sp10 Lecture 3 - ELE8106/6102 13 H. So, Sp10 Lecture 3 - ELE8106/6102 14 Increase clock frequency Linear increase in performance ut heat dissipation has prohibited simple clock frequency boost Improving speed compiler no. of instrs PI (micro) computer architecture NOTE: the number of instructions of a program is closely related to its PI PI changes depending on the app. Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and urton Smith H. So, Sp10 Lecture 3 - ELE8106/6102 15 H. So, Sp10 Lecture 3 - ELE8106/6102 16 Review: PI vs # of instructions program executes the following instruction profile: Instruction Type Number lock ycle dd 2000 1 Multiply 1000 5 Division 500 20 Load 1000 8 Store 500 2 With a clock cycle time of 1ns, how long does the program takes to finish? What is the average PI of the processor? (2000*1 + 1000*5 + 500*20 + 1000*8 + 500*2) * 1ns = 26 us vg. PI = 26,000 / 5000 = 5.2 mdahl s Law Overall speedup due to improving a fraction P with speed up of S is: E.g. if P = 0.2, S=5, then overall speed up is 1 (1 0.2) + 0.2 5 =1.19 If the same improvement can be applied to a larger portion with P=0.9, then speedup = 1 (1 0.9) + 0.9 5 = 3.57 1 (1 P) + P S lways optimize for the common cases. H. So, Sp10 Lecture 3 - ELE8106/6102 17 H. So, Sp10 Lecture 3 - ELE8106/6102 18
Instruction example revisit Instruction Type Number lock ycle dd 2000 1 Multiply 1000 5 Division 500 20 Load 1000 8 Store 500 2 If we can reduce execution speed of any one instruction, which instruction to optimize? ase 1: Optimize dd (2000*0.1 + 1000*5 + 500*20 + 1000*8 + 500*2) = 24.2ms (Speedup = 26/24.2 = 1.07) ase 2: Optimize Load (2000*1 + 1000*5 + 500*20 + 1000*0.8 + 500*2) = 18.8ms (Speedup = 26/18.8 = 1.38) ompiler optimizations Decrease # of instructions E.g. ommon subexpression elimination E.g. onstant propagation (?) use function call instead of macro Use less expensive instructions E.g. Shift left instead of divide by 2 E.g. Register reuse to avoid load/store Many more H. So, Sp10 Lecture 3 - ELE8106/6102 19 H. So, Sp10 Lecture 3 - ELE8106/6102 20 Ex: Predicated instructions if cond { true_part } else { false_part } more_instr Pseudo-code predicated code ssembly code branch cond goto LF true_part goto LD LF: false_part LD: more_instr (cond) true_part (!cond) false_part more_instr Reduce number of instructions Reduce branch mispredictions Improve Instruction-cache hit rate #instr PI PI H. So, Sp10 Lecture 3 - ELE8106/6102 21 Decreasing PI Traditional high performance PU architectures focus on decreasing PI Reduce data/branch hazards PI close to 1 Increase IP (instructions per cycle) Parallel processing PI < 1, IP > 1 Implicit (Hidden below IS) Superscalar Explicit (Exposed through IS) VLIW Vector processors SIMD H. So, Sp10 Lecture 3 - ELE8106/6102 22 Superscalar Processors (1) Key Idea: Issues more than 1 instruction per cycle to make maximum use of computing resources Relatively simple, in-order instruction dispatch+execution Dispatch N consecutive upcoming instructions each cycle until data hazard arises Sophisticated, out-of-order dispatch +execution Execute N not-necessarily consecutive instructions per cycle as long as there is available execution unit H. So, Sp10 Lecture 3 - ELE8106/6102 23 Tomasulo rchitecture From Mem FP Op Queue Load uffers dd1 dd2 dd3 FP adders Mult1 Mult2 Reservation Stations FP Registers FP multipliers Store uffers To Mem ommon Data us (D) dapted from EES252 H. So, Sp10 Lecture 3 - ELE8106/6102 U.. erkeley 24
VLIW Very Long Instruction Word (VLIW) machines Each instruction is in fact composed of multiple smaller, standard instructions 4 to 8 standard instructions per cycle ompiler looks for instructions from the original program that can be issued at the same cycle and pack them into one mega-instruction No dynamic instruction analysis on hardware IF reg EX $ EX $ simplistic VLIW H. So, Sp10 Lecture 3 - ELE8106/6102 25 reg Vector Processors Processor that operates on vectors as basic data type ompared to scalar processor Vector instructions E.g. dd 2 vectors: set_vector_len 64 add vectorr, vector, vector form of data-parallelism Reduces no. of instructions H. So, Sp10 Lecture 3 - ELE8106/6102 26 SIMD Single instruction multiple data class of computation architecture Only one instruction stream is presented, which operates on multiple data streams Vector processing is special form of SIMD in which all data are indeed vectors E.g. Intel s MMX, SSE, SSE2 extensions To implement r1=a1+b1, r2=a2+b2, r3=a3+b3 and r4=a4+b4 in one instruction: add r1,a1,b1,r2,a2,b2,r3,b3,c3,r4,b4,c4 Save no. of instructions May pack 4 8-bit adds into a single 32-bit add Reuse the 32-bit hardware adder (with small modifications) Explicit vs Implicit (1) Instruction Set rchitecture (IS) is the contract between the software and hardware The hardware guarantee certain behavior to the software according to the IS E.g. if an instruction i1 comes before instruction i2, then the effect of i1 will definitely be reflected when i2 is executed Without changing the IS, the hardware must extract all the instruction-level parallelism (ILP) behind the scene yet keeping the promised behavior to software Very complicated hardware design Keeping the IS maintain binary compatibility pplications compiled to run on an Intel 8086 can still be run on a modern Intel ore i7!!! Good division of labor easy development hange in HW won t affect SW SW cannot foresee data-dependent run-time behavior of the program H. So, Sp10 Lecture 3 - ELE8106/6102 27 H. So, Sp10 Lecture 3 - ELE8106/6102 28 Explicit vs Implicit Exposing the underlying parallel architecture to software allows software to bear the burden of extracting parallelism from the application simple hardware Software can take a long time to do the best job because it is a one-off effort ny change to the hardware requires major change to the software tools No division of labor Data-dependent behavior cannot be anticipated during compile time SW cannot fully exploit all possible parallelization opportunities Performance Summary Key to computer performance: no. of instrs PI lock frequency determined by circuit implementations The number of instructions and PI both depends on the tight interaction between the compiler and the computer micro-architecture Implicit parallelism hidden behind the IS puts the burden on low-level hardware implementations to extract ILP Explicit parallelism expose underlying architecture to the compiler and leave the burden to software to extract ILP H. So, Sp10 Lecture 3 - ELE8106/6102 29 H. So, Sp10 Lecture 3 - ELE8106/6102 30
POWER ND ENERGY Power and Energy Power consumption of a circuit is the energy consumed per unit time Power measure how much energy is being used/ dissipated at any one time ffects heat dissipation ffects input power supply Slightly affect battery lifetime Energy consumption is the measure of the absolute amount of energy used to perform certain operation ffects battery capacity oncerns embedded system designers oth metrics important for R designs Some techniques lower power but not energy H. So, Sp10 Lecture 3 - ELE8106/6102 31 H. So, Sp10 Lecture 3 - ELE8106/6102 32 Power, Energy and Performance Power onsumption ctivity factor (amount of circuit switching) Load apacitance (size of circuit) Voltage Supply Swing Voltage lock frequency Dynamic Power Dissipation V in L V out E 0 1 = L 2 E R = 1 V 2 2 L dd E = 1 V 2 2 L dd P total = α( L ) + I sc + I leakage Energy per operation Total Energy onsumption Dynamic Static E op P dyn / = α L E total = E op no. of operations Energy stored from to L during 01 transition Energy drained from L to ground during 10 transition In the absence of static/leakage power consumption, the capacitance keeps the energy stored until discharged Total Run Time T total = no. of operations PI / H. So, Sp10 Lecture 3 - ELE8106/6102 33 H. So, Sp10 Lecture 3 - ELE8106/6102 34 Dynamic Power onsumption P dynamic = Energy/transition transition rate P(transition) = L 2 α = α L 2 = eff 2 Power dissipation depends on data input statistics The more data transitions, the more power is consumed Switching activities oth input switches randomly: i.e. 50% chance that it has 01 transition Probability that Q has a 01 transition: Q=& 0 0 0 0 1 0 1 0 0 1 1 1 ND gate P(Q 0 1 ) = 1 4 3 4 = 3 16 H. So, Sp10 Lecture 3 - ELE8106/6102 35 H. So, Sp10 Lecture 3 - ELE8106/6102 36
Transistor Leakage Transistors are not completely turned off even when they should be. Main contribution from sub-threshold current function of V th and What are the Options? Power onsumption ctivity factor (amount of circuit switching) Load apacitance (size of circuit) Voltage Supply Swing Voltage lock frequency I leak P total = α( L ) + I sc + I leakage Dynamic Static V in = V out V in = 0 V out L L Energy per operation E op P dyn / = α L Should be OFF I leak Total Energy onsumption Total Run Time E total = E op no. of operations T total = no. of operations PI / H. So, Sp10 Lecture 3 - ELE8106/6102 37 H. So, Sp10 Lecture 3 - ELE8106/6102 38