BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

Size: px

Start display at page:

Download "BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power"

Lorin Welch
5 years ago
Views:

, Single chip Heterogeneous Computing: Does the future include Custom Logic, FPGAs, and GPs? Proc.

1 BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing: Does the future include Custom Logic, FPGAs, and GPs? Proc. ACM/IEEE International Symposium on Microarchitecture (MICRO), pp 53 64, December J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s1 Moore s Law The number of transistors that can be economically integrated shall double every 24 months J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s2 [ 1

Circuits, G. E. Moore, 1965] Transistor Scaling gate length J. C.

2 The Real Moore s Law J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s3 [Cramming More Components Onto Integrated Circuits, G. E. Moore, 1965] Transistor Scaling gate length J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s4 [ commercial products are at 20~14nm today distance between silicon atoms ~ 500 pm 2

3 Basic Scaling Theory Planned scaling occurs in nodes ~0.7x of the previous e.g., 90nm, 65nm, 45nm, 32nm, 20nm, 14nm, The End Take the same design, reducing the linear dimensions by 0.7x (aka gate shrink ) leads to (ideally) die area = 0.5x delay = 0.7x, frequency=1.43x capacitance = 0.7x Vdd=0.7x (if constant field) or Vdd=1x (if constant voltage) power = C V 2 f = 0.5x (if constant field) BT power = 1x (if constant voltage) Take the same area, then transistor count = 2x power = 1x (constant field), power = 2x (constant voltage) J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s5 Moore s Law Performance According to scaling theory, we should complexity: 1x transistors at 1.43x frequency 1.43x performance at 0.5x complexity: 2x transistors at 1.43x frequency 2.8x performance at constant power Instead, we have been getting (for high perf CPs) ~2x transistors ~2x frequency (note: faster than scaling would give us) all together we get about ~2x performance at ~2x power J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s6 3

Performance (In)efficiency To hit the expected performance target on single thread microprocessors we had been pushing frequency harder by deepening pipelines we used the 2x transistors to build more

4 Performance (In)efficiency To hit the expected performance target on single thread microprocessors we had been pushing frequency harder by deepening pipelines we used the 2x transistors to build more complicated microarchitectures so the fast/deep pipelines don t stall (i.e., caches, BP, superscalar, out of order) The consequence of performance inefficiency is limit of economical PC cooling [ITRS] IntelP4 Tehas 150W (Guess what happened next) J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s7 [from Shekhar Borkar, IEEE Micro, July 1999] How to run faster on less power? technology normalized power (Watt) Better to replace 1 of this by 2 of these; Or N of these Pentium 4 Power Perf Energy per Instruction Trends in Intel Microprocessors, Grochowski et al., 2006 J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s8 technology normalized performance (op/sec) 4

5 Multis Big Core 1970~ ~?? J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s9 So what is the problem? We (actually Intel, et al.) know how to pack more s on a die to stay on Moore s law in terms of aggregate or throughput performance How to use them? life is good if your N units of work is N independent programs just run them what if your N units of work is N operations of the same program? rewrite a parallel program what if your N units of work is N sequentially dependent operations of the same program??? How many s can you use up meaningfully? J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s10 5

$Let s say a fraction f$ can be sped up by a

6 Future is about Perf/Watt and Op/Joule By 2022 (11nm node), a large die (~500mm 2 ) will have over 10 billion transistors Big Core What will you Custom choose Logic to put on it? GPGP FPGA J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s11 Amdahl s Law on Speedup A program is rarely completely parallelizable. Let s say a fraction f can be sped up by a factor of p and the rest can not time baseline (1 f) f time accellerated (1 f) f/p If f is small p doesn t matter. An architect also cannot ignore the sequential performance of the (1 f) portion J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s12 6

7 Asymmetric Multis Fast Core Base Core Equivalent () in [Hill and Marty, 2008] J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s13 Combine a power hungry fast with lots of simple, power efficient slow s fast for sequential code slow s for parallel sections [Hill and Marty, 2008] 1 Speedup 1 f f perf seq ( n r) perf seq solve for optimal ldie area allocation depending on parallelizability f But, up to 95% of energy of von Neuman processors is overhead [Hameed, et al., ISCA 2010] Modeling Heterogeneous Multis Asymmetric Fast Core Base Core Equivalent Heterogeneous Speedup 1 f perf 1 f ( n r) seq perf seq [Hill and Marty, 2008] simplified f is fraction parallelizable n is total die area in units r is fast area in units perf seq (r) is fast perf. relative to Fast Core J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s14 Speedup 1- f perf seq 1 f ( n - r) For the sake of analysis, break the area for GP/FPGA/etc. into units of s that are the same size as s. Each type is characterized by a relative performance µ and relative power compared to a 7

8 Case Study [Chung, MICRO 2010] CP GPs FPGA ASIC Intel Core i7 960 Nvidia GTX285 Nvidia GTX480 ATI R5870 Xilinx V6 LX760 Std. Cell Year Node 45nm 55nm 40nm 40nm 40nm 65nm Die area 263mm 2 470mm 2 529mm 2 334mm 2 Clock rate 3.2GHz 1.5GHz 1.4GHz 1.5GHz 0.3GHz Single prec. floating point apps M M Mult FFT Black Scholes MKL multithreaded Spiral.net multithreaded J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s15 CBLAS 2.3 CFFT /3.1 PARSEC multithreaded CDA 2.3 CBLAS 3.1 CAL++ hand coded CFFT 3.0 Spiral.net hand coded Best Case Performance and Energy MMM Device GFLOP/s actual (GFLOP/s)/mm 2 normalized to 40nm GFLOP/J normalized to 40nm Intel Core i7 (45nm) Nvidia GTX285 (55nm) Nvidia GTX480 (40nm) ATI R5870 (40nm) Xilinx V6 LX760 (40nm) same RTL std cell (65nm) CP and GP benchmarking is compute bound; FPGA and Std Cell effectively compute bound (no off chip I/O) Power (switching+leakage) measurements isolated the from the system For detail see [Chung, et al. MICRO 2010] J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s16 8

9 Less Regular Applications GFLOP/s (GFLOP/s)/mm 2 GFLOP/J FFT 2 10 Intel Core i7 (45nm) Nvidia GTX285 (55nm) Nvidia GTX480 (40nm) ATI R5870 (40nm) Xilinx V6 LX760 (40nm) same RTL std cell (65nm) Black Scholes Mopt/s (Mopts/s)/mm 2 Mopts/J Intel lcore i7 (45nm) Nvidia GTX285 (55nm) Nvidia GTX480 (40nm) ATI R5870 (40nm) Xilinx V6 LX760 (40nm) same RTL std cell (65nm) J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s17 and μ example values MMM Black Scholes FFT 2 10 Nvidia Φ GTX285 μ Nvidia GTX480 Φ μ On equal area basis, 3.41x performance ATI Φ 1.27 R5870 μ 8.47 at 0.74x power Xilinx Φ 0.31 relative 0.26 a 0.29 LX760 μ Custom Logic Φ μ Nominal based on an Intel Atom in order processor, 26mm 2 in a 45nm process J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s18 9

10 Modeling Power and Bandwidth Budgets Heterogeneous Fast Core Speedup 1 1- f f perf seq ( n - r) The above is based on area alone Power or bandwidth budget limits the usable die area if P is total power budget expressed as a multiple of a s power, then usable area n r P if B is total memory bandwidth expressed also as a multiple of s, then usable area n r B μ J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s19 Combine Model with ITRS Trends Year Technology 40nm 32nm 22nm 16nm 11nm Core die budget (mm 2 ) Normalized area () (16x) Core power (W) Bandwidth (GB/s) (1.4x) Rel pwr per device 1X 0.75X 0.5X 0.36X 0.25X Fast Core 2011 parameters reflect high end h systems today; future parameters extrapolated from ITRS mm 2 populated by an optimally sized Fast Core and s of choice J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s20 10

11 Single Prec. MMMult (f=90%) Speedup ASIC GP R5870 FPGA LX760 Asymmetric multi Symmetric multi f= nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (2) ASIC GP (5) R5870 (1) AsymCMP AsymMC FPGA (3) LX760 Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s21 Single Prec. MMMult (f=99%) 300 ASIC 250 Speedup nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (1) AsymCMP AsymMC (2) ASIC FPGA (3) LX760 GP (5) R5870 GP R5870 FPGA LX760 Sym f=0.990 & Asym multi Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s22 11

12 Single Prec. MMMult (f=50%) Speedup 8 ASIC/GP/FPGA 7 Asymmetric Symmetric 1 f= nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (2) ASIC ASIC GP (5) R5870 (1) AsymCMP AsymMC FPGA (3) LX760 Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s23 Single Prec. FFT 1024 (f=99%) 60 ASIC/GP/FPGA 50 Speedup Asymmetric Symmetric 10 f= nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (2) ASIC (4) GP GTX480 R5870 (1) AsymCMP AsymMC FPGA (3) LX760 Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s24 12

13 FFT 1024 (f=99%) if hypothetical 1TB/sec bandwidth 200 ASIC Speedup FPGA LX760 GP GTX nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (1) AsymCMP AsymMC (2) ASIC ASIC FPGA (3) LX760 (4) GP GTX480 R5870 f=0.990 Sym & Asym multi Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s25 Recap the Last Hour Perf per Watt and op per Joule matter need to look beyond programmable processors GPs and FPGAs are viable programmable candidates GP/FPGA/custom logic help performance only if significant fraction amenable to acceleration, and adequate bandwidth to sustain acceleration 3D stacked memory could be very helpful! Without adequate bandwidth, GP/FPGA catches up with custom logic in achievable performance but remains programmable and flexible Custom logic best for maximizing performance under tight energy/power constraints J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s26 13

14 Computer Architecture Lab (CALCM) Carnegie Mellon niversity J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s27 14

Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University

18 447 Lecture 27: Hardware Acceleration James C. Hoe Department of ECE Carnegie Mellon niversity 18 447 S18 L27 S1, James C. Hoe, CM/ECE/CALCM, 2018 18 447 S18 L27 S2, James C. Hoe, CM/ECE/CALCM, 2018