BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

Similar documents
Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 12: Energy and Power. James C. Hoe Department of ECE Carnegie Mellon University

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory

Lecture 5: Performance and Efficiency. James C. Hoe Department of ECE Carnegie Mellon University

CIS 371 Computer Organization and Design

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Lecture 2: Metrics to Evaluate Systems

Lecture 4: Technology Scaling

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Administrative Stuff

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

CMP 338: Third Class

FPGA Implementation of a Predictive Controller

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

Vector Lane Threading

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis

TU Wien. Energy Efficiency. H. Kopetz 26/11/2009 Model of a Real-Time System

Low power Architectures. Lecture #1:Introduction

From Physics to Logic

Intro To Digital Logic

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Post Von Neumann Computing

Welcome to MCS 572. content and organization expectations of the course. definition and classification

3/10/2013. Lecture #1. How small is Nano? (A movie) What is Nanotechnology? What is Nanoelectronics? What are Emerging Devices?

Implications on the Design

ECE321 Electronics I

CS 700: Quantitative Methods & Experimental Design in Computer Science

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Today. ESE532: System-on-a-Chip Architecture. Why Care? Message. Scaling. Why Care: Custom SoC

Technical Report GIT-CERCS. Thermal Field Management for Many-core Processors

DESIGN AND ANALYSIS OF A FULL ADDER USING VARIOUS REVERSIBLE GATES

Performance Evaluation of MPI on Weather and Hydrological Models

LEADING THE EVOLUTION OF COMPUTE MARK KACHMAREK HPC STRATEGIC PLANNING MANAGER APRIL 17, 2018

Novel Devices and Circuits for Computing

Moores Law for DRAM. 2x increase in capacity every 18 months 2006: 4GB

Performance of Computers. Performance of Computers. Defining Performance. Forecast

Modeling and Analyzing NBTI in the Presence of Process Variation

EEC 216 Lecture #2: Metrics and Logic Level Power Estimation. Rajeevan Amirtharajah University of California, Davis

Last Lecture. Power Dissipation CMOS Scaling. EECS 141 S02 Lecture 8

Performance, Power & Energy

Lecture 3, Performance

Lecture 15: Scaling & Economics

Power in Digital CMOS Circuits. Fruits of Scaling SpecInt 2000

CS-683: Advanced Computer Architecture Course Introduction

Anton: A Specialized ASIC for Molecular Dynamics

CRYPTOGRAPHIC COMPUTING

System Level Leakage Reduction Considering the Interdependence of Temperature and Leakage

Floating Point Representation and Digital Logic. Lecture 11 CS301

CSE493/593. Designing for Low Power

Moore s Law Technology Scaling and CMOS

CS 100: Parallel Computing

A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Previously. ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems. Today. Variation Types. Fabrication

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Variation-Resistant Dynamic Power Optimization for VLSI Circuits

EECS 427 Lecture 11: Power and Energy Reading: EECS 427 F09 Lecture Reminders

Scaling of MOS Circuits. 4. International Technology Roadmap for Semiconductors (ITRS) 6. Scaling factors for device parameters

CMPE12 - Notes chapter 1. Digital Logic. (Textbook Chapter 3)

Numbering Systems. Computational Platforms. Scaling and Round-off Noise. Special Purpose. here that is dedicated architecture

Modern Computer Architecture

Designing Information Devices and Systems I Fall 2015 Anant Sahai, Ali Niknejad Homework 8. This homework is due October 26, 2015, at Noon.

Counters. We ll look at different kinds of counters and discuss how to build them

VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects

Lecture 3, Performance

Impact of Scaling on The Effectiveness of Dynamic Power Reduction Schemes

Qualitative vs Quantitative metrics

Today. ESE534: Computer Organization. Why Care? Why Care. Scaling. ITRS Roadmap

2.1. Unit 2. Digital Circuits (Logic)

Chapter 1. Binary Systems 1-1. Outline. ! Introductions. ! Number Base Conversions. ! Binary Arithmetic. ! Binary Codes. ! Binary Elements 1-2

Molecular Dynamics Simulations

Section 3: Combinational Logic Design. Department of Electrical Engineering, University of Waterloo. Combinational Logic

Lecture 5: Performance (Sequential) James C. Hoe Department of ECE Carnegie Mellon University

CMP N 301 Computer Architecture. Appendix C

Lecture 23. Dealing with Interconnect. Impact of Interconnect Parasitics

Lecture 2: CMOS technology. Energy-aware computing

Skew-Tolerant Circuit Design

PARADE: PARAmetric Delay Evaluation Under Process Variation * (Revised Version)

Lecture: Pipelining Basics

VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

LECTURE 28. Analyzing digital computation at a very low level! The Latch Pipelined Datapath Control Signals Concept of State

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

Professor Lee, Yong Surk. References. Topics Microprocessor & microcontroller. High Performance Microprocessor Architecture Overview

Where Does Power Go in CMOS?

Computer Architecture

Datapath Component Tradeoffs

Exascale Computing for Radio Astronomy: GPU or FPGA?

τ gd =Q/I=(CV)/I I d,sat =(µc OX /2)(W/L)(V gs -V TH ) 2 ESE534 Computer Organization Today At Issue Preclass 1 Energy and Delay Tradeoff

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

A CUDA Solver for Helmholtz Equation

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Computer Architecture ELEC2401 & ELEC3441

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Transcription:

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing: Does the future include Custom Logic, FPGAs, and GPs? Proc. ACM/IEEE International Symposium on Microarchitecture (MICRO), pp 53 64, December 2010. J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s1 Moore s Law The number of transistors that can be economically integrated shall double every 24 months J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s2 [http://www.intel.com/research/silicon/mooreslaw.htm] 1

The Real Moore s Law J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s3 [Cramming More Components Onto Integrated Circuits, G. E. Moore, 1965] Transistor Scaling gate length J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s4 [http://www.intel.com/museum/online/circuits.htm] commercial products are at 20~14nm today distance between silicon atoms ~ 500 pm 2

Basic Scaling Theory Planned scaling occurs in nodes ~0.7x of the previous e.g., 90nm, 65nm, 45nm, 32nm, 20nm, 14nm, The End Take the same design, reducing the linear dimensions by 0.7x (aka gate shrink ) leads to (ideally) die area = 0.5x delay = 0.7x, frequency=1.43x capacitance = 0.7x Vdd=0.7x (if constant field) or Vdd=1x (if constant voltage) power = C V 2 f = 0.5x (if constant field) BT power = 1x (if constant voltage) Take the same area, then transistor count = 2x power = 1x (constant field), power = 2x (constant voltage) J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s5 Moore s Law Performance According to scaling theory, we should get @constant complexity: 1x transistors at 1.43x frequency 1.43x performance at 0.5x power @max complexity: 2x transistors at 1.43x frequency 2.8x performance at constant power Instead, we have been getting (for high perf CPs) ~2x transistors ~2x frequency (note: faster than scaling would give us) all together we get about ~2x performance at ~2x power J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s6 3

Performance (In)efficiency To hit the expected performance target on single thread microprocessors we had been pushing frequency harder by deepening pipelines we used the 2x transistors to build more complicated microarchitectures so the fast/deep pipelines don t stall (i.e., caches, BP, superscalar, out of order) The consequence of performance inefficiency is limit of economical PC cooling [ITRS] IntelP4 Tehas 150W (Guess what happened next) J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s7 [from Shekhar Borkar, IEEE Micro, July 1999] How to run faster on less power? technology normalized power (Watt) Better to replace 1 of this by 2 of these; Or N of these Pentium 4 Power Perf 1.75 486 Energy per Instruction Trends in Intel Microprocessors, Grochowski et al., 2006 J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s8 technology normalized performance (op/sec) 4

Multis Big Core 1970~2005 2005~?? J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s9 So what is the problem? We (actually Intel, et al.) know how to pack more s on a die to stay on Moore s law in terms of aggregate or throughput performance How to use them? life is good if your N units of work is N independent programs just run them what if your N units of work is N operations of the same program? rewrite a parallel program what if your N units of work is N sequentially dependent operations of the same program??? How many s can you use up meaningfully? J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s10 5

Future is about Perf/Watt and Op/Joule By 2022 (11nm node), a large die (~500mm 2 ) will have over 10 billion transistors Big Core What will you Custom choose Logic to put on it? GPGP FPGA J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s11 Amdahl s Law on Speedup A program is rarely completely parallelizable. Let s say a fraction f can be sped up by a factor of p and the rest can not time baseline (1 f) f time accellerated (1 f) f/p If f is small p doesn t matter. An architect also cannot ignore the sequential performance of the (1 f) portion J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s12 6

Asymmetric Multis Fast Core Base Core Equivalent () in [Hill and Marty, 2008] J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s13 Combine a power hungry fast with lots of simple, power efficient slow s fast for sequential code slow s for parallel sections [Hill and Marty, 2008] 1 Speedup 1 f f perf seq ( n r) perf seq solve for optimal ldie area allocation depending on parallelizability f But, up to 95% of energy of von Neuman processors is overhead [Hameed, et al., ISCA 2010] Modeling Heterogeneous Multis Asymmetric Fast Core Base Core Equivalent Heterogeneous Speedup 1 f perf 1 f ( n r) seq perf seq [Hill and Marty, 2008] simplified f is fraction parallelizable n is total die area in units r is fast area in units perf seq (r) is fast perf. relative to Fast Core J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s14 Speedup 1- f perf seq 1 f ( n - r) For the sake of analysis, break the area for GP/FPGA/etc. into units of s that are the same size as s. Each type is characterized by a relative performance µ and relative power compared to a 7

Case Study [Chung, MICRO 2010] CP GPs FPGA ASIC Intel Core i7 960 Nvidia GTX285 Nvidia GTX480 ATI R5870 Xilinx V6 LX760 Std. Cell Year 2009 2008 2010 2009 2009 2007 Node 45nm 55nm 40nm 40nm 40nm 65nm Die area 263mm 2 470mm 2 529mm 2 334mm 2 Clock rate 3.2GHz 1.5GHz 1.4GHz 1.5GHz 0.3GHz Single prec. floating point apps M M Mult FFT Black Scholes MKL 10.2.3 multithreaded Spiral.net multithreaded J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s15 CBLAS 2.3 CFFT 2.3 3.0/3.1 PARSEC multithreaded CDA 2.3 CBLAS 3.1 CAL++ hand coded CFFT 3.0 Spiral.net hand coded Best Case Performance and Energy MMM Device GFLOP/s actual (GFLOP/s)/mm 2 normalized to 40nm GFLOP/J normalized to 40nm Intel Core i7 (45nm) 96 0.50 1.14 Nvidia GTX285 (55nm) 425 2.40 6.78 Nvidia GTX480 (40nm) 541 1.28 3.52 ATI R5870 (40nm) 1491 5.95 9.87 Xilinx V6 LX760 (40nm) 204 0.53 3.62 same RTL std cell (65nm) 19.28 50.73 CP and GP benchmarking is compute bound; FPGA and Std Cell effectively compute bound (no off chip I/O) Power (switching+leakage) measurements isolated the from the system For detail see [Chung, et al. MICRO 2010] J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s16 8

Less Regular Applications GFLOP/s (GFLOP/s)/mm 2 GFLOP/J FFT 2 10 Intel Core i7 (45nm) 67 0.35 0.71 Nvidia GTX285 (55nm) 250 1.41 4.2 Nvidia GTX480 (40nm) 453 1.08 4.3 ATI R5870 (40nm) Xilinx V6 LX760 (40nm) 380 0.99 6.5 same RTL std cell (65nm) 952 239 90 Black Scholes Mopt/s (Mopts/s)/mm 2 Mopts/J Intel lcore i7 (45nm) 487 2.52 4.88 Nvidia GTX285 (55nm) 10756 60.72 189 Nvidia GTX480 (40nm) ATI R5870 (40nm) Xilinx V6 LX760 (40nm) 7800 20.26 138 same RTL std cell (65nm) 25532 1719 642.5 J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s17 and μ example values MMM Black Scholes FFT 2 10 Nvidia Φ 0.74 0.57 0.63 GTX285 μ 3.41 17.0 2.88 Nvidia GTX480 Φ 0.77 0.47 μ 1.83 2.20 On equal area basis, 3.41x performance ATI Φ 1.27 R5870 μ 8.47 at 0.74x power Xilinx Φ 0.31 relative 0.26 a 0.29 LX760 μ 0.75 5.68 2.02 Custom Logic Φ 0.79 4.75 4.96 μ 27.4 482 489 Nominal based on an Intel Atom in order processor, 26mm 2 in a 45nm process J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s18 9

Modeling Power and Bandwidth Budgets Heterogeneous Fast Core Speedup 1 1- f f perf seq ( n - r) The above is based on area alone Power or bandwidth budget limits the usable die area if P is total power budget expressed as a multiple of a s power, then usable area n r P if B is total memory bandwidth expressed also as a multiple of s, then usable area n r B μ J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s19 Combine Model with ITRS Trends Year 2011 2013 2016 2019 2022 Technology 40nm 32nm 22nm 16nm 11nm Core die budget (mm 2 ) 432 432 432 432 432 Normalized area () 19 37 75 149 298 (16x) Core power (W) 100 100 100 100 100 Bandwidth (GB/s) 180 198 234 234 252 (1.4x) Rel pwr per device 1X 0.75X 0.5X 0.36X 0.25X Fast Core 2011 parameters reflect high end h systems today; future parameters extrapolated from ITRS 2009 432mm 2 populated by an optimally sized Fast Core and s of choice J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s20 10

Single Prec. MMMult (f=90%) Speedup 40 35 30 25 20 15 10 5 ASIC GP R5870 FPGA LX760 Asymmetric multi Symmetric multi f=0.900 0 40nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (2) ASIC GP (5) R5870 (1) AsymCMP AsymMC FPGA (3) LX760 Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s21 Single Prec. MMMult (f=99%) 300 ASIC 250 Speedup 200 150 100 50 0 40nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (1) AsymCMP AsymMC (2) ASIC FPGA (3) LX760 GP (5) R5870 GP R5870 FPGA LX760 Sym f=0.990 & Asym multi Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s22 11

Single Prec. MMMult (f=50%) Speedup 8 ASIC/GP/FPGA 7 Asymmetric 6 5 4 3 2 Symmetric 1 f=0.500 0 40nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (2) ASIC ASIC GP (5) R5870 (1) AsymCMP AsymMC FPGA (3) LX760 Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s23 Single Prec. FFT 1024 (f=99%) 60 ASIC/GP/FPGA 50 Speedup 40 30 20 Asymmetric Symmetric 10 f=0.990 0 40nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (2) ASIC (4) GP GTX480 R5870 (1) AsymCMP AsymMC FPGA (3) LX760 Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s24 12

FFT 1024 (f=99%) if hypothetical 1TB/sec bandwidth 200 ASIC Speedup 150 100 FPGA LX760 GP GTX480 50 0 40nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (1) AsymCMP AsymMC (2) ASIC ASIC FPGA (3) LX760 (4) GP GTX480 R5870 f=0.990 Sym & Asym multi Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s25 Recap the Last Hour Perf per Watt and op per Joule matter need to look beyond programmable processors GPs and FPGAs are viable programmable candidates GP/FPGA/custom logic help performance only if significant fraction amenable to acceleration, and adequate bandwidth to sustain acceleration 3D stacked memory could be very helpful! Without adequate bandwidth, GP/FPGA catches up with custom logic in achievable performance but remains programmable and flexible Custom logic best for maximizing performance under tight energy/power constraints J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s26 13

Computer Architecture Lab (CALCM) Carnegie Mellon niversity http://www.ece.cmu.edu/calcm J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s27 14