Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University

Size: px

Start display at page:

Download "Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University"

Ilene Riley
5 years ago
Views:

1 Lecture 27: Hardware Acceleration James C. Hoe Department of ECE Carnegie Mellon niversity S18 L27 S1, James C. Hoe, CM/ECE/CALCM, 2018

2 S18 L27 S2, James C. Hoe, CM/ECE/CALCM, 2018 Housekeeping Your goal today see why you should care about accelerators know the basics to think about the topic Notices Lab4, due this week HW5, past due practice final solutions (hardcopies) Readings Amdahl's Law in the Multi Era, 2008 (optional) Single Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPs? 2010 (optional)

3 HW Acceleration is nothing new! What needed to be faster/smaller/cheaper/ lower energy than SW has always been done in HW we go to HW when SW isn t good enough because good HW can be more efficient we don t go to HW when SW is good enough because good HW takes more work When we say HW acceleration, we always mean efficient and not just correct S18 L27 S3, James C. Hoe, CM/ECE/CALCM, 2018

4 Computing s Brave New World Microsoft Catapult [MICRO 2016, Caulfield, et al.] Google TP [Hotchips, 2017, Jeff Dean] S18 L27 S4, James C. Hoe, CM/ECE/CALCM, 2018

5 How we got here.... Big Core 1970~ ~?? S18 L27 S5, James C. Hoe, CM/ECE/CALCM, 2018

6 Moore s Law without Dennard Scaling Intl. Technology Roadmap for Semiconductors logic density VDD >16x 1 25% 0 node label ?? S18 L27 S6, James C. Hoe, CM/ECE/CALCM, 2018 nder fixed power ceiling, more ops/second only achievable if less Joules/op?

7 Future is about Performance/Watt and Ops/Joule Big Core What will (or can) you do with all those transistors? GPGP FPGA Custom Logic S18 L27 S7, James C. Hoe, CM/ECE/CALCM, 2018 This is a sign of desperation....

8 Why is Computing Directly in Hardware Efficient? S18 L27 S8, James C. Hoe, CM/ECE/CALCM, 2018

9 S18 L27 S9, James C. Hoe, CM/ECE/CALCM, 2018 Why is HW/FPGA better? no overhead A processor spends a lot of transistors & energy to present von Neumann ISA abstraction to support a broad application base (e.g., caches, superscalar out of order, prefetching,...) In fact, processor is mostly overhead ~90% energy [Hameed, ISCA 2010, Tensilica ] ~95% energy [Balfour, CAL 2007, embedded RISC ] even worse on a high perf superscalar OoO proc Computing directly in application specific hardware can be 10x to 100x more energy efficient

10 Why is HW/FPGA better? For a given functionality, non linear tradeoff between power and performance slower design is simpler lower frequency needs lower voltage For the same throughput, replacing 1 module by 2 half as fast reduces total power and energy S18 L27 S10, James C. Hoe, CM/ECE/CALCM, 2018 efficiency of parallelism Power (Joule/sec) Better to replace 1 of this by 2 of these; or N of these Power Perf k>1 Perf (op/sec) Good hardware designs derive performance from parallelism

11 Software to Hardware Spectrum Software CP: highest level abstraction / most general purpose support GP: explicitly parallel programs / best for SIMD, regular FPGA: ASIC like abstraction / overhead for reprogrammability ASIC: lowest level abstraction / fixed application and tuning Hardware tradeoff between efficiency and effort S18 L27 S11, James C. Hoe, CM/ECE/CALCM, 2018

12 Case Study [Chung, MICRO 2010] CP GPs FPGA ASIC Intel Core i7 960 Nvidia GTX285 ATI R5870 Xilinx V6 LX760 Std. Cell Year Node 45nm 55nm 40nm 40nm 65nm Die area 263mm 2 470mm 2 334mm 2 Clock rate 3.2GHz 1.5GHz 1.5GHz 0.3GHz Single prec. floating point apps M M Mult FFT Black Scholes MKL Multithreaded Spiral.net Multithreaded PARSEC multithreaded CBLAS 2.3 CAL++ hand coded CFFT /3.1 CDA 2.3 Spiral.net hand coded S18 L27 S12, James C. Hoe, CM/ECE/CALCM, 2018

13 Best Case Performance and Energy MMM Device GFLOP/s actual (GFLOP/s)/mm 2 normalized to 40nm GFLOP/J normalized to 40nm Intel Core i7 (45nm) Nvidia GTX285 (55nm) ATI R5870 (40nm) XilinxV6 LX760 (40nm) same RTL std cell (65nm) CP and GP benchmarking is compute bound; FPGA and Std Cell effectively compute bound (no off chip I/O) Power (switching+leakage) measurements isolated the from the system For detail see [Chung, et al. MICRO 2010] S18 L27 S13, James C. Hoe, CM/ECE/CALCM, 2018

14 Less Regular Applications GFLOP/s (GFLOP/s)/mm 2 GFLOP/J FFT 2 10 Intel Core i7 (45nm) Nvidia GTX285 (55nm) ATI R5870 (40nm) XilinxV6 LX760 (40nm) same RTL std cell (65nm) Mopt/s (Mopts/s)/mm 2 Mopts/J Black Scholes Intel Core i7 (45nm) Nvidia GTX285 (55nm) ATI R5870 (40nm) XilinxV6 LX760 (40nm) same RTL std cell (65nm) S18 L27 S14, James C. Hoe, CM/ECE/CALCM, 2018

15 ASIC isn t always ultimate in performance Amdahl s Law: S overall = 1 / ( (1 f)+ f/s f ) S f ASIC > S f FPGA but f ASIC f FPGA f FPGA > f ASIC (when not perfectly app specific) more flexible design to cover a greater fraction reprogram FPGA to cover different applications S f ASIC S overall S f FPGA [based on Joel Emer s original comment about programmable accelerators in general] S18 L27 S15, James C. Hoe, CM/ECE/CALCM, 2018 f ASIC f FPGA (at break even)

16 S18 L27 S16, James C. Hoe, CM/ECE/CALCM, 2018 Tradeoff in Heterogeneity?

17 Amdahl s Law on Multi Base Core Equivalent () in [Hill and Marty, 2008] A program is rarely completely parallelizable; let s say a fraction f is perfectly parallelizable Speedup of n s over sequential Speedup for small f, die area underutilized 1 (1 f ) f n S18 L27 S17, James C. Hoe, CM/ECE/CALCM, 2018

18 F=0.999 F=0.99 F=0.9 F=0.5 more smaller s S18 L27 S18, James C. Hoe, CM/ECE/CALCM, 2018 size of s in fewer larger s

19 Asymmetric Multis Fast Core Base Core Equivalent () in [Hill and Marty, 2008] S18 L27 S19, James C. Hoe, CM/ECE/CALCM, 2018 Pwr/area efficient slow s vs pwr/area hungry fast fast for sequential code slow s for parallel sections [Hill and Marty, 2008] Speedup 1 f perf 1 f ( n r) seq perf seq r = cost of fast in perf seq = speedup of fast over solve for optimal die allocation

20 F=0.999 F=0.99 assume perf of fast grows with sqrt of area F=0.9 F= S18 L27 S20, James C. Hoe, CM/ECE/CALCM, 2018 size of fast in

21 Asymmetric Fast Core Fast Core S18 L27 S21, James C. Hoe, CM/ECE/CALCM, 2018 Heterogeneous Multis [Chung, et al. MICRO 2010] Base Core Equivalent Heterogeneous Speedup Speedup 1 f perf 1- f perf seq perf seq seq 1 f ( n r) [Hill and Marty, 2008] simplified f is fraction parallelizable n is total die area in units r is fast area in units perf seq (r) is fast perf. relative to 1 f ( n - r) For the sake of analysis, break the area for GP/FPGA/etc. into units of s that are the same size as s. Each type is characterized by a relative performance µ and relative power compared to a

22 and μ example values MMM Black Scholes FFT 2 10 Nvidia GTX285 Xilinx LX760 Custom Logic Φ μ On equal area basis, 3.41x performance at 0.74x power relative a Φ μ Φ μ S18 L27 S22, James C. Hoe, CM/ECE/CALCM, 2018 Nominal based on an Intel Atom in order processor, 26mm 2 in a 45nm process

23 Modeling Power and Bandwidth Budgets Heterogeneous Fast Core Speedup 1- f perf seq 1 f ( n - r) The above is based on area alone Power or bandwidth budget limits the usable die area if P is total power budget expressed as a multiple of a s power, usable area n r P if B is total memory bandwidth expressed as a multiple of s, usable area n r B μ S18 L27 S23, James C. Hoe, CM/ECE/CALCM, 2018

24 Combine Model with ITRS Trends Year Technology 40nm 32nm 22nm 16nm 11nm Core die budget (mm 2 ) Normalized area () (16x) Core power (W) Bandwidth (GB/s) (1.4x) Rel pwr per device 1X 0.75X 0.5X 0.36X 0.25X Fast Core 2011 parameters reflect high end systems of the day; future parameters extrapolated from ITRS mm 2 populated by an optimally sized Fast Core and s of choice S18 L27 S24, James C. Hoe, CM/ECE/CALCM, 2018

25 Single Prec. MMMult (f=99%) 300 ASIC 250 Speedup nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC FPGA (3) LX760 GP R5870 (5) R5870 GP R5870 FPGA LX760 Sym f=0.990 & Asym multi Power Bound Mem Bound S18 L27 S25, James C. Hoe, CM/ECE/CALCM, 2018

26 Single Prec. MMMult (f=90%) Speedup ASIC GP R5870 FPGA LX760 Asymmetric multi Symmetric multi f=0.900 x 0 40nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC FPGA (3) LX760 GP R5870 (5) R5870 Power Bound Mem Bound S18 L27 S26, James C. Hoe, CM/ECE/CALCM, 2018

27 Single Prec. MMMult (f=50%) ASIC/GP/FPGA Asymmetric Symmetric Speedup f= nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC ASIC FPGA (3) LX760 GP R5870 (5) R5870 Power Bound Mem Bound S18 L27 S27, James C. Hoe, CM/ECE/CALCM, 2018

28 Single Prec. FFT 1024 (f=99%) 60 ASIC/GP/FPGA 50 Speedup Asymmetric Symmetric 10 f= nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC FPGA (3) LX760 GP R5870 (4) GTX480 Power Bound Mem Bound S18 L27 S28, James C. Hoe, CM/ECE/CALCM, 2018

29 FFT 1024 (f=99%) if hypothetical 1TB/sec bandwidth 200 ASIC Speedup FPGA LX760 GP GTX nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC ASIC FPGA (3) LX760 GP R5870 (4) GTX480 f=0.990 Sym & Asym multi Power Bound Mem Bound S18 L27 S29, James C. Hoe, CM/ECE/CALCM, 2018

30 You will be seeing more of this Performance scaling requires improved efficiency in Op/Joules and Perf/Watt Hardware acceleration is the most direct way to improve energy/power efficiency Need better hardware design methodology to enable application developers (without losing hardware s advantages) Software is easy; hardware is hard? Hardware isn t hard; perf and efficiency is!!! S18 L27 S30, James C. Hoe, CM/ECE/CALCM, 2018

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing: