Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University

Similar documents
BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

Lecture 12: Energy and Power. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 5: Performance and Efficiency. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory

Lecture 5: Performance (Sequential) James C. Hoe Department of ECE Carnegie Mellon University

Performance Evaluation of MPI on Weather and Hydrological Models

Lecture 15: Scaling & Economics

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Vector Lane Threading

Intro To Digital Logic

CMP 338: Third Class

Lecture 2: Metrics to Evaluate Systems

CIS 371 Computer Organization and Design

CS 700: Quantitative Methods & Experimental Design in Computer Science

Anton: A Specialized ASIC for Molecular Dynamics

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Lecture 4: Technology Scaling

Administrative Stuff

Modern Computer Architecture

Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions. March 1, 2013

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Last Lecture. Power Dissipation CMOS Scaling. EECS 141 S02 Lecture 8

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Post Von Neumann Computing

From Physics to Logic

CS 100: Parallel Computing

Scaling of MOS Circuits. 4. International Technology Roadmap for Semiconductors (ITRS) 6. Scaling factors for device parameters

FPGA Implementation of a Predictive Controller

τ gd =Q/I=(CV)/I I d,sat =(µc OX /2)(W/L)(V gs -V TH ) 2 ESE534 Computer Organization Today At Issue Preclass 1 Energy and Delay Tradeoff

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

ECE 574 Cluster Computing Lecture 20

Julian Merten. GPU Computing and Alternative Architecture

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

EEC 118 Lecture #6: CMOS Logic. Rajeevan Amirtharajah University of California, Davis Jeff Parkhurst Intel Corporation

Performance, Power & Energy

In 1980, the yield = 48% and the Die Area = 0.16 from figure In 1992, the yield = 48% and the Die Area = 0.97 from figure 1.31.

FPGA-based Niederreiter Cryptosystem using Binary Goppa Codes

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

Lecture 3, Performance

LEADING THE EVOLUTION OF COMPUTE MARK KACHMAREK HPC STRATEGIC PLANNING MANAGER APRIL 17, 2018

Numbering Systems. Computational Platforms. Scaling and Round-off Noise. Special Purpose. here that is dedicated architecture

Summary of Hyperion Research's First QC Expert Panel Survey Questions/Answers. Bob Sorensen, Earl Joseph, Steve Conway, and Alex Norton

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

Power in Digital CMOS Circuits. Fruits of Scaling SpecInt 2000

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

WRF performance tuning for the Intel Woodcrest Processor

CS-683: Advanced Computer Architecture Course Introduction

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Goals for Performance Lecture

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

EEC 116 Lecture #5: CMOS Logic. Rajeevan Amirtharajah Bevan Baas University of California, Davis Jeff Parkhurst Intel Corporation

Novel Devices and Circuits for Computing

Introduction The Nature of High-Performance Computation

CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues

Previously. ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems. Today. Variation Types. Fabrication

Today. ESE532: System-on-a-Chip Architecture. Why Care? Message. Scaling. Why Care: Custom SoC

New Exploration Frameworks for Temperature-Aware Design of MPSoCs. Prof. David Atienza

Where Does Power Go in CMOS?

Parallel Performance Theory - 1

A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem

Skew-Tolerant Circuit Design

Continuous heat flow analysis. Time-variant heat sources. Embedded Systems Laboratory (ESL) Institute of EE, Faculty of Engineering

Analytical Modeling of Parallel Systems

Tradeoff between Reliability and Power Management

EE 240B Spring Advanced Analog Integrated Circuits Lecture 2: MOS Transistor Models. Elad Alon Dept. of EECS

Parallelization of the QC-lib Quantum Computer Simulator Library

Lecture 3, Performance

3/10/2013. Lecture #1. How small is Nano? (A movie) What is Nanotechnology? What is Nanoelectronics? What are Emerging Devices?

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Qualitative vs Quantitative metrics

Lecture 23. Dealing with Interconnect. Impact of Interconnect Parasitics

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Lecture 11. Advanced Dividers

Toward More Accurate Scaling Estimates of CMOS Circuits from 180 nm to 22 nm

Datapath Component Tradeoffs

14.1. Unit 14. State Machine Design

CRYPTOGRAPHIC COMPUTING

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic

Universität Dortmund UCHPC. Performance. Computing for Finite Element Simulations

2.1. Unit 2. Digital Circuits (Logic)

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

Dense Arithmetic over Finite Fields with CUMODP

Parallelization of the QC-lib Quantum Computer Simulator Library

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Lecture 25. Dealing with Interconnect and Timing. Digital Integrated Circuits Interconnect

2. Accelerated Computations

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Transcription:

18 447 Lecture 27: Hardware Acceleration James C. Hoe Department of ECE Carnegie Mellon niversity 18 447 S18 L27 S1, James C. Hoe, CM/ECE/CALCM, 2018

18 447 S18 L27 S2, James C. Hoe, CM/ECE/CALCM, 2018 Housekeeping Your goal today see why you should care about accelerators know the basics to think about the topic Notices Lab4, due this week HW5, past due practice final solutions (hardcopies) Readings Amdahl's Law in the Multi Era, 2008 (optional) Single Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPs? 2010 (optional)

HW Acceleration is nothing new! What needed to be faster/smaller/cheaper/ lower energy than SW has always been done in HW we go to HW when SW isn t good enough because good HW can be more efficient we don t go to HW when SW is good enough because good HW takes more work When we say HW acceleration, we always mean efficient and not just correct 18 447 S18 L27 S3, James C. Hoe, CM/ECE/CALCM, 2018

Computing s Brave New World Microsoft Catapult [MICRO 2016, Caulfield, et al.] Google TP [Hotchips, 2017, Jeff Dean] 18 447 S18 L27 S4, James C. Hoe, CM/ECE/CALCM, 2018

How we got here.... Big Core 1970~2005 2005~?? 18 447 S18 L27 S5, James C. Hoe, CM/ECE/CALCM, 2018

Moore s Law without Dennard Scaling 100 10 2013 Intl. Technology Roadmap for Semiconductors logic density VDD >16x 1 25% 0 node label 14 10 7 5 3.5 2.5 1.8?? 18 447 S18 L27 S6, James C. Hoe, CM/ECE/CALCM, 2018 nder fixed power ceiling, more ops/second only achievable if less Joules/op?

Future is about Performance/Watt and Ops/Joule Big Core What will (or can) you do with all those transistors? GPGP FPGA Custom Logic 18 447 S18 L27 S7, James C. Hoe, CM/ECE/CALCM, 2018 This is a sign of desperation....

Why is Computing Directly in Hardware Efficient? 18 447 S18 L27 S8, James C. Hoe, CM/ECE/CALCM, 2018

18 447 S18 L27 S9, James C. Hoe, CM/ECE/CALCM, 2018 Why is HW/FPGA better? no overhead A processor spends a lot of transistors & energy to present von Neumann ISA abstraction to support a broad application base (e.g., caches, superscalar out of order, prefetching,...) In fact, processor is mostly overhead ~90% energy [Hameed, ISCA 2010, Tensilica ] ~95% energy [Balfour, CAL 2007, embedded RISC ] even worse on a high perf superscalar OoO proc Computing directly in application specific hardware can be 10x to 100x more energy efficient

Why is HW/FPGA better? For a given functionality, non linear tradeoff between power and performance slower design is simpler lower frequency needs lower voltage For the same throughput, replacing 1 module by 2 half as fast reduces total power and energy 18 447 S18 L27 S10, James C. Hoe, CM/ECE/CALCM, 2018 efficiency of parallelism Power (Joule/sec) Better to replace 1 of this by 2 of these; or N of these Power Perf k>1 Perf (op/sec) Good hardware designs derive performance from parallelism

Software to Hardware Spectrum Software CP: highest level abstraction / most general purpose support GP: explicitly parallel programs / best for SIMD, regular FPGA: ASIC like abstraction / overhead for reprogrammability ASIC: lowest level abstraction / fixed application and tuning Hardware tradeoff between efficiency and effort 18 447 S18 L27 S11, James C. Hoe, CM/ECE/CALCM, 2018

Case Study [Chung, MICRO 2010] CP GPs FPGA ASIC Intel Core i7 960 Nvidia GTX285 ATI R5870 Xilinx V6 LX760 Std. Cell Year 2009 2008 2009 2009 2007 Node 45nm 55nm 40nm 40nm 65nm Die area 263mm 2 470mm 2 334mm 2 Clock rate 3.2GHz 1.5GHz 1.5GHz 0.3GHz Single prec. floating point apps M M Mult FFT Black Scholes MKL 10.2.3 Multithreaded Spiral.net Multithreaded PARSEC multithreaded CBLAS 2.3 CAL++ hand coded CFFT 2.3 3.0/3.1 CDA 2.3 Spiral.net hand coded 18 447 S18 L27 S12, James C. Hoe, CM/ECE/CALCM, 2018

Best Case Performance and Energy MMM Device GFLOP/s actual (GFLOP/s)/mm 2 normalized to 40nm GFLOP/J normalized to 40nm Intel Core i7 (45nm) 96 0.50 1.14 Nvidia GTX285 (55nm) 425 2.40 6.78 ATI R5870 (40nm) 1491 5.95 9.87 XilinxV6 LX760 (40nm) 204 0.53 3.62 same RTL std cell (65nm) 19.28 50.73 CP and GP benchmarking is compute bound; FPGA and Std Cell effectively compute bound (no off chip I/O) Power (switching+leakage) measurements isolated the from the system For detail see [Chung, et al. MICRO 2010] 18 447 S18 L27 S13, James C. Hoe, CM/ECE/CALCM, 2018

Less Regular Applications GFLOP/s (GFLOP/s)/mm 2 GFLOP/J FFT 2 10 Intel Core i7 (45nm) 67 0.35 0.71 Nvidia GTX285 (55nm) 250 1.41 4.2 ATI R5870 (40nm) XilinxV6 LX760 (40nm) 380 0.99 6.5 same RTL std cell (65nm) 952 239 90 Mopt/s (Mopts/s)/mm 2 Mopts/J Black Scholes Intel Core i7 (45nm) 487 2.52 4.88 Nvidia GTX285 (55nm) 10756 60.72 189 ATI R5870 (40nm) XilinxV6 LX760 (40nm) 7800 20.26 138 same RTL std cell (65nm) 25532 1719 642.5 18 447 S18 L27 S14, James C. Hoe, CM/ECE/CALCM, 2018

ASIC isn t always ultimate in performance Amdahl s Law: S overall = 1 / ( (1 f)+ f/s f ) S f ASIC > S f FPGA but f ASIC f FPGA f FPGA > f ASIC (when not perfectly app specific) more flexible design to cover a greater fraction reprogram FPGA to cover different applications S f ASIC S overall S f FPGA [based on Joel Emer s original comment about programmable accelerators in general] 18 447 S18 L27 S15, James C. Hoe, CM/ECE/CALCM, 2018 f ASIC f FPGA (at break even)

18 447 S18 L27 S16, James C. Hoe, CM/ECE/CALCM, 2018 Tradeoff in Heterogeneity?

Amdahl s Law on Multi Base Core Equivalent () in [Hill and Marty, 2008] A program is rarely completely parallelizable; let s say a fraction f is perfectly parallelizable Speedup of n s over sequential Speedup for small f, die area underutilized 1 (1 f ) f n 18 447 S18 L27 S17, James C. Hoe, CM/ECE/CALCM, 2018

http://research.cs.wisc.edu/multifacet/amdahl/ F=0.999 F=0.99 F=0.9 F=0.5 more smaller s 18 447 S18 L27 S18, James C. Hoe, CM/ECE/CALCM, 2018 size of s in fewer larger s

Asymmetric Multis Fast Core Base Core Equivalent () in [Hill and Marty, 2008] 18 447 S18 L27 S19, James C. Hoe, CM/ECE/CALCM, 2018 Pwr/area efficient slow s vs pwr/area hungry fast fast for sequential code slow s for parallel sections [Hill and Marty, 2008] Speedup 1 f perf 1 f ( n r) seq perf seq r = cost of fast in perf seq = speedup of fast over solve for optimal die allocation

http://research.cs.wisc.edu/multifacet/amdahl/ F=0.999 F=0.99 assume perf of fast grows with sqrt of area F=0.9 F=0.5 18 447 S18 L27 S20, James C. Hoe, CM/ECE/CALCM, 2018 size of fast in

Asymmetric Fast Core Fast Core 18 447 S18 L27 S21, James C. Hoe, CM/ECE/CALCM, 2018 Heterogeneous Multis [Chung, et al. MICRO 2010] Base Core Equivalent Heterogeneous Speedup Speedup 1 f perf 1- f perf seq perf seq seq 1 f ( n r) [Hill and Marty, 2008] simplified f is fraction parallelizable n is total die area in units r is fast area in units perf seq (r) is fast perf. relative to 1 f ( n - r) For the sake of analysis, break the area for GP/FPGA/etc. into units of s that are the same size as s. Each type is characterized by a relative performance µ and relative power compared to a

and μ example values MMM Black Scholes FFT 2 10 Nvidia GTX285 Xilinx LX760 Custom Logic Φ 0.74 0.57 0.63 μ 3.41 17.0 2.88 On equal area basis, 3.41x performance at 0.74x power relative a Φ 0.31 0.26 0.29 μ 0.75 5.68 2.02 Φ 0.79 4.75 4.96 μ 27.4 482 489 18 447 S18 L27 S22, James C. Hoe, CM/ECE/CALCM, 2018 Nominal based on an Intel Atom in order processor, 26mm 2 in a 45nm process

Modeling Power and Bandwidth Budgets Heterogeneous Fast Core Speedup 1- f perf seq 1 f ( n - r) The above is based on area alone Power or bandwidth budget limits the usable die area if P is total power budget expressed as a multiple of a s power, usable area n r P if B is total memory bandwidth expressed as a multiple of s, usable area n r B μ 18 447 S18 L27 S23, James C. Hoe, CM/ECE/CALCM, 2018

Combine Model with ITRS Trends Year 2011 2013 2016 2019 2022 Technology 40nm 32nm 22nm 16nm 11nm Core die budget (mm 2 ) 432 432 432 432 432 Normalized area () 19 37 75 149 298 (16x) Core power (W) 100 100 100 100 100 Bandwidth (GB/s) 180 198 234 234 252 (1.4x) Rel pwr per device 1X 0.75X 0.5X 0.36X 0.25X Fast Core 2011 parameters reflect high end systems of the day; future parameters extrapolated from ITRS 2009 432mm 2 populated by an optimally sized Fast Core and s of choice 18 447 S18 L27 S24, James C. Hoe, CM/ECE/CALCM, 2018

Single Prec. MMMult (f=99%) 300 ASIC 250 Speedup 200 150 100 50 0 40nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC FPGA (3) LX760 GP R5870 (5) R5870 GP R5870 FPGA LX760 Sym f=0.990 & Asym multi Power Bound Mem Bound 18 447 S18 L27 S25, James C. Hoe, CM/ECE/CALCM, 2018

Single Prec. MMMult (f=90%) Speedup 40 35 30 25 20 15 10 5 ASIC GP R5870 FPGA LX760 Asymmetric multi Symmetric multi f=0.900 x 0 40nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC FPGA (3) LX760 GP R5870 (5) R5870 Power Bound Mem Bound 18 447 S18 L27 S26, James C. Hoe, CM/ECE/CALCM, 2018

Single Prec. MMMult (f=50%) 8 7 6 ASIC/GP/FPGA Asymmetric Symmetric Speedup 5 4 3 2 1 f=0.500 0 40nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC ASIC FPGA (3) LX760 GP R5870 (5) R5870 Power Bound Mem Bound 18 447 S18 L27 S27, James C. Hoe, CM/ECE/CALCM, 2018

Single Prec. FFT 1024 (f=99%) 60 ASIC/GP/FPGA 50 Speedup 40 30 20 Asymmetric Symmetric 10 f=0.990 0 40nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC FPGA (3) LX760 GP R5870 (4) GTX480 Power Bound Mem Bound 18 447 S18 L27 S28, James C. Hoe, CM/ECE/CALCM, 2018

FFT 1024 (f=99%) if hypothetical 1TB/sec bandwidth 200 ASIC Speedup 150 100 FPGA LX760 GP GTX480 50 0 40nm 32nm 22nm 16nm 11nm SymMC AsymMC (0) SymCMP (1) AsymCMP (2) ASIC ASIC FPGA (3) LX760 GP R5870 (4) GTX480 f=0.990 Sym & Asym multi Power Bound Mem Bound 18 447 S18 L27 S29, James C. Hoe, CM/ECE/CALCM, 2018

You will be seeing more of this Performance scaling requires improved efficiency in Op/Joules and Perf/Watt Hardware acceleration is the most direct way to improve energy/power efficiency Need better hardware design methodology to enable application developers (without losing hardware s advantages) Software is easy; hardware is hard? Hardware isn t hard; perf and efficiency is!!! 18 447 S18 L27 S30, James C. Hoe, CM/ECE/CALCM, 2018