BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing: Does the future include Custom Logic, FPGAs, and GPs? Proc. ACM/IEEE International Symposium on Microarchitecture (MICRO), pp 53 64, December 2010. J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s1 Moore s Law The number of transistors that can be economically integrated shall double every 24 months J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s2 [http://www.intel.com/research/silicon/mooreslaw.htm] 1
The Real Moore s Law J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s3 [Cramming More Components Onto Integrated Circuits, G. E. Moore, 1965] Transistor Scaling gate length J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s4 [http://www.intel.com/museum/online/circuits.htm] commercial products are at 20~14nm today distance between silicon atoms ~ 500 pm 2
Basic Scaling Theory Planned scaling occurs in nodes ~0.7x of the previous e.g., 90nm, 65nm, 45nm, 32nm, 20nm, 14nm, The End Take the same design, reducing the linear dimensions by 0.7x (aka gate shrink ) leads to (ideally) die area = 0.5x delay = 0.7x, frequency=1.43x capacitance = 0.7x Vdd=0.7x (if constant field) or Vdd=1x (if constant voltage) power = C V 2 f = 0.5x (if constant field) BT power = 1x (if constant voltage) Take the same area, then transistor count = 2x power = 1x (constant field), power = 2x (constant voltage) J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s5 Moore s Law Performance According to scaling theory, we should get @constant complexity: 1x transistors at 1.43x frequency 1.43x performance at 0.5x power @max complexity: 2x transistors at 1.43x frequency 2.8x performance at constant power Instead, we have been getting (for high perf CPs) ~2x transistors ~2x frequency (note: faster than scaling would give us) all together we get about ~2x performance at ~2x power J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s6 3
Performance (In)efficiency To hit the expected performance target on single thread microprocessors we had been pushing frequency harder by deepening pipelines we used the 2x transistors to build more complicated microarchitectures so the fast/deep pipelines don t stall (i.e., caches, BP, superscalar, out of order) The consequence of performance inefficiency is limit of economical PC cooling [ITRS] IntelP4 Tehas 150W (Guess what happened next) J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s7 [from Shekhar Borkar, IEEE Micro, July 1999] How to run faster on less power? technology normalized power (Watt) Better to replace 1 of this by 2 of these; Or N of these Pentium 4 Power Perf 1.75 486 Energy per Instruction Trends in Intel Microprocessors, Grochowski et al., 2006 J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s8 technology normalized performance (op/sec) 4
Multis Big Core 1970~2005 2005~?? J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s9 So what is the problem? We (actually Intel, et al.) know how to pack more s on a die to stay on Moore s law in terms of aggregate or throughput performance How to use them? life is good if your N units of work is N independent programs just run them what if your N units of work is N operations of the same program? rewrite a parallel program what if your N units of work is N sequentially dependent operations of the same program??? How many s can you use up meaningfully? J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s10 5
Future is about Perf/Watt and Op/Joule By 2022 (11nm node), a large die (~500mm 2 ) will have over 10 billion transistors Big Core What will you Custom choose Logic to put on it? GPGP FPGA J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s11 Amdahl s Law on Speedup A program is rarely completely parallelizable. Let s say a fraction f can be sped up by a factor of p and the rest can not time baseline (1 f) f time accellerated (1 f) f/p If f is small p doesn t matter. An architect also cannot ignore the sequential performance of the (1 f) portion J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s12 6
Asymmetric Multis Fast Core Base Core Equivalent () in [Hill and Marty, 2008] J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s13 Combine a power hungry fast with lots of simple, power efficient slow s fast for sequential code slow s for parallel sections [Hill and Marty, 2008] 1 Speedup 1 f f perf seq ( n r) perf seq solve for optimal ldie area allocation depending on parallelizability f But, up to 95% of energy of von Neuman processors is overhead [Hameed, et al., ISCA 2010] Modeling Heterogeneous Multis Asymmetric Fast Core Base Core Equivalent Heterogeneous Speedup 1 f perf 1 f ( n r) seq perf seq [Hill and Marty, 2008] simplified f is fraction parallelizable n is total die area in units r is fast area in units perf seq (r) is fast perf. relative to Fast Core J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s14 Speedup 1- f perf seq 1 f ( n - r) For the sake of analysis, break the area for GP/FPGA/etc. into units of s that are the same size as s. Each type is characterized by a relative performance µ and relative power compared to a 7
Case Study [Chung, MICRO 2010] CP GPs FPGA ASIC Intel Core i7 960 Nvidia GTX285 Nvidia GTX480 ATI R5870 Xilinx V6 LX760 Std. Cell Year 2009 2008 2010 2009 2009 2007 Node 45nm 55nm 40nm 40nm 40nm 65nm Die area 263mm 2 470mm 2 529mm 2 334mm 2 Clock rate 3.2GHz 1.5GHz 1.4GHz 1.5GHz 0.3GHz Single prec. floating point apps M M Mult FFT Black Scholes MKL 10.2.3 multithreaded Spiral.net multithreaded J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s15 CBLAS 2.3 CFFT 2.3 3.0/3.1 PARSEC multithreaded CDA 2.3 CBLAS 3.1 CAL++ hand coded CFFT 3.0 Spiral.net hand coded Best Case Performance and Energy MMM Device GFLOP/s actual (GFLOP/s)/mm 2 normalized to 40nm GFLOP/J normalized to 40nm Intel Core i7 (45nm) 96 0.50 1.14 Nvidia GTX285 (55nm) 425 2.40 6.78 Nvidia GTX480 (40nm) 541 1.28 3.52 ATI R5870 (40nm) 1491 5.95 9.87 Xilinx V6 LX760 (40nm) 204 0.53 3.62 same RTL std cell (65nm) 19.28 50.73 CP and GP benchmarking is compute bound; FPGA and Std Cell effectively compute bound (no off chip I/O) Power (switching+leakage) measurements isolated the from the system For detail see [Chung, et al. MICRO 2010] J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s16 8
Less Regular Applications GFLOP/s (GFLOP/s)/mm 2 GFLOP/J FFT 2 10 Intel Core i7 (45nm) 67 0.35 0.71 Nvidia GTX285 (55nm) 250 1.41 4.2 Nvidia GTX480 (40nm) 453 1.08 4.3 ATI R5870 (40nm) Xilinx V6 LX760 (40nm) 380 0.99 6.5 same RTL std cell (65nm) 952 239 90 Black Scholes Mopt/s (Mopts/s)/mm 2 Mopts/J Intel lcore i7 (45nm) 487 2.52 4.88 Nvidia GTX285 (55nm) 10756 60.72 189 Nvidia GTX480 (40nm) ATI R5870 (40nm) Xilinx V6 LX760 (40nm) 7800 20.26 138 same RTL std cell (65nm) 25532 1719 642.5 J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s17 and μ example values MMM Black Scholes FFT 2 10 Nvidia Φ 0.74 0.57 0.63 GTX285 μ 3.41 17.0 2.88 Nvidia GTX480 Φ 0.77 0.47 μ 1.83 2.20 On equal area basis, 3.41x performance ATI Φ 1.27 R5870 μ 8.47 at 0.74x power Xilinx Φ 0.31 relative 0.26 a 0.29 LX760 μ 0.75 5.68 2.02 Custom Logic Φ 0.79 4.75 4.96 μ 27.4 482 489 Nominal based on an Intel Atom in order processor, 26mm 2 in a 45nm process J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s18 9
Modeling Power and Bandwidth Budgets Heterogeneous Fast Core Speedup 1 1- f f perf seq ( n - r) The above is based on area alone Power or bandwidth budget limits the usable die area if P is total power budget expressed as a multiple of a s power, then usable area n r P if B is total memory bandwidth expressed also as a multiple of s, then usable area n r B μ J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s19 Combine Model with ITRS Trends Year 2011 2013 2016 2019 2022 Technology 40nm 32nm 22nm 16nm 11nm Core die budget (mm 2 ) 432 432 432 432 432 Normalized area () 19 37 75 149 298 (16x) Core power (W) 100 100 100 100 100 Bandwidth (GB/s) 180 198 234 234 252 (1.4x) Rel pwr per device 1X 0.75X 0.5X 0.36X 0.25X Fast Core 2011 parameters reflect high end h systems today; future parameters extrapolated from ITRS 2009 432mm 2 populated by an optimally sized Fast Core and s of choice J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s20 10
Single Prec. MMMult (f=90%) Speedup 40 35 30 25 20 15 10 5 ASIC GP R5870 FPGA LX760 Asymmetric multi Symmetric multi f=0.900 0 40nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (2) ASIC GP (5) R5870 (1) AsymCMP AsymMC FPGA (3) LX760 Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s21 Single Prec. MMMult (f=99%) 300 ASIC 250 Speedup 200 150 100 50 0 40nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (1) AsymCMP AsymMC (2) ASIC FPGA (3) LX760 GP (5) R5870 GP R5870 FPGA LX760 Sym f=0.990 & Asym multi Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s22 11
Single Prec. MMMult (f=50%) Speedup 8 ASIC/GP/FPGA 7 Asymmetric 6 5 4 3 2 Symmetric 1 f=0.500 0 40nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (2) ASIC ASIC GP (5) R5870 (1) AsymCMP AsymMC FPGA (3) LX760 Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s23 Single Prec. FFT 1024 (f=99%) 60 ASIC/GP/FPGA 50 Speedup 40 30 20 Asymmetric Symmetric 10 f=0.990 0 40nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (2) ASIC (4) GP GTX480 R5870 (1) AsymCMP AsymMC FPGA (3) LX760 Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s24 12
FFT 1024 (f=99%) if hypothetical 1TB/sec bandwidth 200 ASIC Speedup 150 100 FPGA LX760 GP GTX480 50 0 40nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (1) AsymCMP AsymMC (2) ASIC ASIC FPGA (3) LX760 (4) GP GTX480 R5870 f=0.990 Sym & Asym multi Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s25 Recap the Last Hour Perf per Watt and op per Joule matter need to look beyond programmable processors GPs and FPGAs are viable programmable candidates GP/FPGA/custom logic help performance only if significant fraction amenable to acceleration, and adequate bandwidth to sustain acceleration 3D stacked memory could be very helpful! Without adequate bandwidth, GP/FPGA catches up with custom logic in achievable performance but remains programmable and flexible Custom logic best for maximizing performance under tight energy/power constraints J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s26 13
Computer Architecture Lab (CALCM) Carnegie Mellon niversity http://www.ece.cmu.edu/calcm J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s27 14