BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power
|
|
- Lorin Welch
- 5 years ago
- Views:
Transcription
1 BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing: Does the future include Custom Logic, FPGAs, and GPs? Proc. ACM/IEEE International Symposium on Microarchitecture (MICRO), pp 53 64, December J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s1 Moore s Law The number of transistors that can be economically integrated shall double every 24 months J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s2 [ 1
2 The Real Moore s Law J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s3 [Cramming More Components Onto Integrated Circuits, G. E. Moore, 1965] Transistor Scaling gate length J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s4 [ commercial products are at 20~14nm today distance between silicon atoms ~ 500 pm 2
3 Basic Scaling Theory Planned scaling occurs in nodes ~0.7x of the previous e.g., 90nm, 65nm, 45nm, 32nm, 20nm, 14nm, The End Take the same design, reducing the linear dimensions by 0.7x (aka gate shrink ) leads to (ideally) die area = 0.5x delay = 0.7x, frequency=1.43x capacitance = 0.7x Vdd=0.7x (if constant field) or Vdd=1x (if constant voltage) power = C V 2 f = 0.5x (if constant field) BT power = 1x (if constant voltage) Take the same area, then transistor count = 2x power = 1x (constant field), power = 2x (constant voltage) J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s5 Moore s Law Performance According to scaling theory, we should complexity: 1x transistors at 1.43x frequency 1.43x performance at 0.5x complexity: 2x transistors at 1.43x frequency 2.8x performance at constant power Instead, we have been getting (for high perf CPs) ~2x transistors ~2x frequency (note: faster than scaling would give us) all together we get about ~2x performance at ~2x power J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s6 3
4 Performance (In)efficiency To hit the expected performance target on single thread microprocessors we had been pushing frequency harder by deepening pipelines we used the 2x transistors to build more complicated microarchitectures so the fast/deep pipelines don t stall (i.e., caches, BP, superscalar, out of order) The consequence of performance inefficiency is limit of economical PC cooling [ITRS] IntelP4 Tehas 150W (Guess what happened next) J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s7 [from Shekhar Borkar, IEEE Micro, July 1999] How to run faster on less power? technology normalized power (Watt) Better to replace 1 of this by 2 of these; Or N of these Pentium 4 Power Perf Energy per Instruction Trends in Intel Microprocessors, Grochowski et al., 2006 J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s8 technology normalized performance (op/sec) 4
5 Multis Big Core 1970~ ~?? J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s9 So what is the problem? We (actually Intel, et al.) know how to pack more s on a die to stay on Moore s law in terms of aggregate or throughput performance How to use them? life is good if your N units of work is N independent programs just run them what if your N units of work is N operations of the same program? rewrite a parallel program what if your N units of work is N sequentially dependent operations of the same program??? How many s can you use up meaningfully? J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s10 5
6 Future is about Perf/Watt and Op/Joule By 2022 (11nm node), a large die (~500mm 2 ) will have over 10 billion transistors Big Core What will you Custom choose Logic to put on it? GPGP FPGA J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s11 Amdahl s Law on Speedup A program is rarely completely parallelizable. Let s say a fraction f can be sped up by a factor of p and the rest can not time baseline (1 f) f time accellerated (1 f) f/p If f is small p doesn t matter. An architect also cannot ignore the sequential performance of the (1 f) portion J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s12 6
7 Asymmetric Multis Fast Core Base Core Equivalent () in [Hill and Marty, 2008] J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s13 Combine a power hungry fast with lots of simple, power efficient slow s fast for sequential code slow s for parallel sections [Hill and Marty, 2008] 1 Speedup 1 f f perf seq ( n r) perf seq solve for optimal ldie area allocation depending on parallelizability f But, up to 95% of energy of von Neuman processors is overhead [Hameed, et al., ISCA 2010] Modeling Heterogeneous Multis Asymmetric Fast Core Base Core Equivalent Heterogeneous Speedup 1 f perf 1 f ( n r) seq perf seq [Hill and Marty, 2008] simplified f is fraction parallelizable n is total die area in units r is fast area in units perf seq (r) is fast perf. relative to Fast Core J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s14 Speedup 1- f perf seq 1 f ( n - r) For the sake of analysis, break the area for GP/FPGA/etc. into units of s that are the same size as s. Each type is characterized by a relative performance µ and relative power compared to a 7
8 Case Study [Chung, MICRO 2010] CP GPs FPGA ASIC Intel Core i7 960 Nvidia GTX285 Nvidia GTX480 ATI R5870 Xilinx V6 LX760 Std. Cell Year Node 45nm 55nm 40nm 40nm 40nm 65nm Die area 263mm 2 470mm 2 529mm 2 334mm 2 Clock rate 3.2GHz 1.5GHz 1.4GHz 1.5GHz 0.3GHz Single prec. floating point apps M M Mult FFT Black Scholes MKL multithreaded Spiral.net multithreaded J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s15 CBLAS 2.3 CFFT /3.1 PARSEC multithreaded CDA 2.3 CBLAS 3.1 CAL++ hand coded CFFT 3.0 Spiral.net hand coded Best Case Performance and Energy MMM Device GFLOP/s actual (GFLOP/s)/mm 2 normalized to 40nm GFLOP/J normalized to 40nm Intel Core i7 (45nm) Nvidia GTX285 (55nm) Nvidia GTX480 (40nm) ATI R5870 (40nm) Xilinx V6 LX760 (40nm) same RTL std cell (65nm) CP and GP benchmarking is compute bound; FPGA and Std Cell effectively compute bound (no off chip I/O) Power (switching+leakage) measurements isolated the from the system For detail see [Chung, et al. MICRO 2010] J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s16 8
9 Less Regular Applications GFLOP/s (GFLOP/s)/mm 2 GFLOP/J FFT 2 10 Intel Core i7 (45nm) Nvidia GTX285 (55nm) Nvidia GTX480 (40nm) ATI R5870 (40nm) Xilinx V6 LX760 (40nm) same RTL std cell (65nm) Black Scholes Mopt/s (Mopts/s)/mm 2 Mopts/J Intel lcore i7 (45nm) Nvidia GTX285 (55nm) Nvidia GTX480 (40nm) ATI R5870 (40nm) Xilinx V6 LX760 (40nm) same RTL std cell (65nm) J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s17 and μ example values MMM Black Scholes FFT 2 10 Nvidia Φ GTX285 μ Nvidia GTX480 Φ μ On equal area basis, 3.41x performance ATI Φ 1.27 R5870 μ 8.47 at 0.74x power Xilinx Φ 0.31 relative 0.26 a 0.29 LX760 μ Custom Logic Φ μ Nominal based on an Intel Atom in order processor, 26mm 2 in a 45nm process J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s18 9
10 Modeling Power and Bandwidth Budgets Heterogeneous Fast Core Speedup 1 1- f f perf seq ( n - r) The above is based on area alone Power or bandwidth budget limits the usable die area if P is total power budget expressed as a multiple of a s power, then usable area n r P if B is total memory bandwidth expressed also as a multiple of s, then usable area n r B μ J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s19 Combine Model with ITRS Trends Year Technology 40nm 32nm 22nm 16nm 11nm Core die budget (mm 2 ) Normalized area () (16x) Core power (W) Bandwidth (GB/s) (1.4x) Rel pwr per device 1X 0.75X 0.5X 0.36X 0.25X Fast Core 2011 parameters reflect high end h systems today; future parameters extrapolated from ITRS mm 2 populated by an optimally sized Fast Core and s of choice J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s20 10
11 Single Prec. MMMult (f=90%) Speedup ASIC GP R5870 FPGA LX760 Asymmetric multi Symmetric multi f= nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (2) ASIC GP (5) R5870 (1) AsymCMP AsymMC FPGA (3) LX760 Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s21 Single Prec. MMMult (f=99%) 300 ASIC 250 Speedup nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (1) AsymCMP AsymMC (2) ASIC FPGA (3) LX760 GP (5) R5870 GP R5870 FPGA LX760 Sym f=0.990 & Asym multi Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s22 11
12 Single Prec. MMMult (f=50%) Speedup 8 ASIC/GP/FPGA 7 Asymmetric Symmetric 1 f= nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (2) ASIC ASIC GP (5) R5870 (1) AsymCMP AsymMC FPGA (3) LX760 Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s23 Single Prec. FFT 1024 (f=99%) 60 ASIC/GP/FPGA 50 Speedup Asymmetric Symmetric 10 f= nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (2) ASIC (4) GP GTX480 R5870 (1) AsymCMP AsymMC FPGA (3) LX760 Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s24 12
13 FFT 1024 (f=99%) if hypothetical 1TB/sec bandwidth 200 ASIC Speedup FPGA LX760 GP GTX nm 32nm 22nm 16nm 11nm (0) SymCMP SymMC (1) AsymCMP AsymMC (2) ASIC ASIC FPGA (3) LX760 (4) GP GTX480 R5870 f=0.990 Sym & Asym multi Power Bound Mem Bound J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s25 Recap the Last Hour Perf per Watt and op per Joule matter need to look beyond programmable processors GPs and FPGAs are viable programmable candidates GP/FPGA/custom logic help performance only if significant fraction amenable to acceleration, and adequate bandwidth to sustain acceleration 3D stacked memory could be very helpful! Without adequate bandwidth, GP/FPGA catches up with custom logic in achievable performance but remains programmable and flexible Custom logic best for maximizing performance under tight energy/power constraints J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s26 13
14 Computer Architecture Lab (CALCM) Carnegie Mellon niversity J. C. Hoe, CM/ECE/CALCM, 2014, BHSC L7 s27 14
Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 27: Hardware Acceleration James C. Hoe Department of ECE Carnegie Mellon niversity 18 447 S18 L27 S1, James C. Hoe, CM/ECE/CALCM, 2018 18 447 S18 L27 S2, James C. Hoe, CM/ECE/CALCM, 2018
More informationLecture 12: Energy and Power. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 12: Energy and Power James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L12 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today a working understanding of
More informationMICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE
MICROPROCESSOR www.mpronline.com REPORT THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE ENERGY COROLLARIES TO AMDAHL S LAW Analyzing the Interactions Between Parallel Execution and Energy Consumption By
More informationL16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory
L16: Power Dissipation in Digital Systems 1 Problem #1: Power Dissipation/Heat Power (Watts) 100000 10000 1000 100 10 1 0.1 4004 80088080 8085 808686 386 486 Pentium proc 18KW 5KW 1.5KW 500W 1971 1974
More informationLecture 5: Performance and Efficiency. James C. Hoe Department of ECE Carnegie Mellon University
18 643 Lecture 5: Performance and Efficiency James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L05 S1, James C. Hoe, CMU/ECE/CALCM, 2017 18 643 F17 L05 S2, James C. Hoe, CMU/ECE/CALCM,
More informationCIS 371 Computer Organization and Design
CIS 371 Computer Organization and Design Unit 13: Power & Energy Slides developed by Milo Mar0n & Amir Roth at the University of Pennsylvania with sources that included University of Wisconsin slides by
More informationLecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 23: Illusiveness of Parallel Performance James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L23 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping peel
More informationPERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah
PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.
More informationToday. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of
ESE532: System-on-a-Chip Architecture Day 20: November 8, 2017 Energy Today Energy Today s bottleneck What drives Efficiency of Processors, FPGAs, accelerators How does parallelism impact energy? 1 2 Message
More informationLecture 2: Metrics to Evaluate Systems
Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video
More informationLecture 4: Technology Scaling
Digital Integrated Circuits (83-313) Lecture 4: Technology Scaling Semester B, 2016-17 Lecturer: Dr. Adam Teman TAs: Itamar Levi, Robert Giterman 2 April 2017 Disclaimer: This course was prepared, in its
More informationToday. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of
ESE532: System-on-a-Chip Architecture Day 22: April 10, 2017 Today Today s bottleneck What drives Efficiency of Processors, FPGAs, accelerators 1 2 Message dominates Including limiting performance Make
More informationPerformance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So
Performance Metrics & Architectural Adaptivity ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So What are the Options? Power Consumption Activity factor (amount of circuit switching) Load Capacitance (size
More informationAdministrative Stuff
EE141- Spring 2004 Digital Integrated Circuits Lecture 30 PERSPECTIVES 1 Administrative Stuff Homework 10 posted just for practice. No need to turn in (hw 9 due today). Normal office hours next week. HKN
More informationINF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)
INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder
More informationPerformance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So
Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION
More informationCMP 338: Third Class
CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does
More informationFPGA Implementation of a Predictive Controller
FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan
More informationAmdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:
Amdahl's Law Useful for evaluating the impact of a change. (A general observation.) Insight: Improving a feature cannot improve performance beyond the use of the feature Suppose we introduce a particular
More informationVector Lane Threading
Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program
More informationEEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis
EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture Rajeevan Amirtharajah University of California, Davis Outline Announcements Review: PDP, EDP, Intersignal Correlations, Glitching, Top
More informationTU Wien. Energy Efficiency. H. Kopetz 26/11/2009 Model of a Real-Time System
TU Wien 1 Energy Efficiency Outline 2 Basic Concepts Energy Estimation Hardware Scaling Power Gating Software Techniques Energy Sources 3 Why are Energy and Power Awareness Important? The widespread use
More informationLow power Architectures. Lecture #1:Introduction
Low power Architectures Lecture #1:Introduction Dr. Avi Mendelson mendlson@ee.technion.ac.il Contributors: Ronny Ronen, Eli Savransky, Shekhar Borkar, Fred PollackP Technion, EE department Dr. Avi Mendelson,
More informationFrom Physics to Logic
From Physics to Logic This course aims to introduce you to the layers of abstraction of modern computer systems. We won t spend much time below the level of bits, bytes, words, and functional units, but
More informationIntro To Digital Logic
Intro To Digital Logic 1 Announcements... Project 2.2 out But delayed till after the midterm Midterm in a week Covers up to last lecture + next week's homework & lab Nick goes "H-Bomb of Justice" About
More informationMark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions
Lecture Slides Intro Number Systems Logic Functions EE 0 in Context EE 0 EE 20L Logic Design Fundamentals Logic Design, CAD Tools, Lab tools, Project EE 357 EE 457 Computer Architecture Using the logic
More informationMicro-architecture Pipelining Optimization with Throughput- Aware Floorplanning
Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Yuchun Ma* Zhuoyuan Li* Jason Cong Xianlong Hong Glenn Reinman Sheqin Dong* Qiang Zhou *Department of Computer Science &
More informationPost Von Neumann Computing
Post Von Neumann Computing Matthias Kaiserswerth Hasler Stiftung (formerly IBM Research) 1 2014 IBM Corporation Foundation Purpose Support information and communication technologies (ICT) to advance Switzerland
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More information3/10/2013. Lecture #1. How small is Nano? (A movie) What is Nanotechnology? What is Nanoelectronics? What are Emerging Devices?
EECS 498/598: Nanocircuits and Nanoarchitectures Lecture 1: Introduction to Nanotelectronic Devices (Sept. 5) Lectures 2: ITRS Nanoelectronics Road Map (Sept 7) Lecture 3: Nanodevices; Guest Lecture by
More informationImplications on the Design
Implications on the Design Ramon Canal NCD Master MIRI NCD Master MIRI 1 VLSI Basics Resistance: Capacity: Agenda Energy Consumption Static Dynamic Thermal maps Voltage Scaling Metrics NCD Master MIRI
More informationECE321 Electronics I
ECE321 Electronics I Lecture 1: Introduction to Digital Electronics Payman Zarkesh-Ha Office: ECE Bldg. 230B Office hours: Tuesday 2:00-3:00PM or by appointment E-mail: payman@ece.unm.edu Slide: 1 Textbook
More informationCS 700: Quantitative Methods & Experimental Design in Computer Science
CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,
More informationNCU EE -- DSP VLSI Design. Tsung-Han Tsai 1
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using
More informationECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018
ECE 172 Digital Systems Chapter 12 Instruction Pipelining Herbert G. Mayer, PSU Status 7/20/2018 1 Syllabus l Scheduling on Pipelined Architecture l Idealized Pipeline l Goal of Scheduling l Causes for
More informationToday. ESE532: System-on-a-Chip Architecture. Why Care? Message. Scaling. Why Care: Custom SoC
ESE532: System-on-a-Chip Architecture Day 21: April 5, 2017 VLSI Scaling 1 Today VLSI Scaling Rules Effects Historical/predicted scaling Variations (cheating) Limits Note: gory equations! goal is to understand
More informationTechnical Report GIT-CERCS. Thermal Field Management for Many-core Processors
Technical Report GIT-CERCS Thermal Field Management for Many-core Processors Minki Cho, Nikhil Sathe, Sudhakar Yalamanchili and Saibal Mukhopadhyay School of Electrical and Computer Engineering Georgia
More informationDESIGN AND ANALYSIS OF A FULL ADDER USING VARIOUS REVERSIBLE GATES
DESIGN AND ANALYSIS OF A FULL ADDER USING VARIOUS REVERSIBLE GATES Sudhir Dakey Faculty,Department of E.C.E., MVSR Engineering College Abstract The goal of VLSI has remained unchanged since many years
More informationPerformance Evaluation of MPI on Weather and Hydrological Models
NCAR/RAL Performance Evaluation of MPI on Weather and Hydrological Models Alessandro Fanfarillo elfanfa@ucar.edu August 8th 2018 Cheyenne - NCAR Supercomputer Cheyenne is a 5.34-petaflops, high-performance
More informationLEADING THE EVOLUTION OF COMPUTE MARK KACHMAREK HPC STRATEGIC PLANNING MANAGER APRIL 17, 2018
LEADING THE EVOLUTION OF COMPUTE MARK KACHMAREK HPC STRATEGIC PLANNING MANAGER APRIL 17, 2018 INTEL S RESEARCH EFFORTS COMPONENTS RESEARCH INTEL LABS ENABLING MOORE S LAW DEVELOPING NOVEL INTEGRATION ENABLING
More informationNovel Devices and Circuits for Computing
Novel Devices and Circuits for Computing UCSB 594BB Winter 2013 Lecture 4: Resistive switching: Logic Class Outline Material Implication logic Stochastic computing Reconfigurable logic Material Implication
More informationMoores Law for DRAM. 2x increase in capacity every 18 months 2006: 4GB
MEMORY Moores Law for DRAM 2x increase in capacity every 18 months 2006: 4GB Corollary to Moores Law Cost / chip ~ constant (packaging) Cost / bit = 2X reduction / 18 months Current (2008) ~ 1 micro-cent
More informationPerformance of Computers. Performance of Computers. Defining Performance. Forecast
Performance of Computers Which computer is fastest? Not so simple scientific simulation - FP performance program development - Integer performance commercial work - I/O Performance of Computers Want to
More informationModeling and Analyzing NBTI in the Presence of Process Variation
Modeling and Analyzing NBTI in the Presence of Process Variation Taniya Siddiqua, Sudhanva Gurumurthi, Mircea R. Stan Dept. of Computer Science, Dept. of Electrical and Computer Engg., University of Virginia
More informationEEC 216 Lecture #2: Metrics and Logic Level Power Estimation. Rajeevan Amirtharajah University of California, Davis
EEC 216 Lecture #2: Metrics and Logic Level Power Estimation Rajeevan Amirtharajah University of California, Davis Announcements PS1 available online tonight R. Amirtharajah, EEC216 Winter 2008 2 Outline
More informationLast Lecture. Power Dissipation CMOS Scaling. EECS 141 S02 Lecture 8
EECS 141 S02 Lecture 8 Power Dissipation CMOS Scaling Last Lecture CMOS Inverter loading Switching Performance Evaluation Design optimization Inverter Sizing 1 Today CMOS Inverter power dissipation» Dynamic»
More informationPerformance, Power & Energy
Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance?
More informationLecture 3, Performance
Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations
More informationLecture 15: Scaling & Economics
Lecture 15: Scaling & Economics Outline Scaling Transistors Interconnect Future Challenges Economics 2 Moore s Law Recall that Moore s Law has been driving CMOS [Moore65] Corollary: clock speeds have improved
More informationPower in Digital CMOS Circuits. Fruits of Scaling SpecInt 2000
Power in Digital CMOS Circuits Mark Horowitz Computer Systems Laboratory Stanford University horowitz@stanford.edu Copyright 2004 by Mark Horowitz MAH 1 Fruits of Scaling SpecInt 2000 1000.00 100.00 10.00
More informationCS-683: Advanced Computer Architecture Course Introduction
CS-683: Advanced Computer Architecture Course Introduction Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology
More informationAnton: A Specialized ASIC for Molecular Dynamics
Anton: A Specialized ASIC for Molecular Dynamics Martin M. Deneroff, David E. Shaw, Ron O. Dror, Jeffrey S. Kuskin, Richard H. Larson, John K. Salmon, and Cliff Young D. E. Shaw Research Marty.Deneroff@DEShawResearch.com
More informationCRYPTOGRAPHIC COMPUTING
CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,
More informationSystem Level Leakage Reduction Considering the Interdependence of Temperature and Leakage
System Level Leakage Reduction Considering the Interdependence of Temperature and Leakage Lei He, Weiping Liao and Mircea R. Stan EE Department, University of California, Los Angeles 90095 ECE Department,
More informationFloating Point Representation and Digital Logic. Lecture 11 CS301
Floating Point Representation and Digital Logic Lecture 11 CS301 Administrative Daily Review of today s lecture w Due tomorrow (10/4) at 8am Lab #3 due Friday (9/7) 1:29pm HW #5 assigned w Due Monday 10/8
More informationCSE493/593. Designing for Low Power
CSE493/593 Designing for Low Power Mary Jane Irwin [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.].1 Why Power Matters Packaging costs Power supply rail design Chip and system
More informationMoore s Law Technology Scaling and CMOS
Design Challenges in Digital High Performance Circuits Outline Manoj achdev Dept. of Electrical and Computer Engineering University of Waterloo Waterloo, Ontario, Canada Power truggle ummary Moore s Law
More informationCS 100: Parallel Computing
CS 100: Parallel Computing Chris Kauffman Week 12 Logistics Upcoming HW 5: Due Friday by 11:59pm HW 6: Up by Early Next Week Moore s Law: CPUs get faster Smaller transistors closer together Smaller transistors
More informationA High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem
A High Throughput FPGA-Based Implementation of the Lanczos Method for the Symmetric Extremal Eigenvalue Problem Abid Rafique, Nachiket Kapre, and George A. Constantinides Electrical and Electronic Engineering,
More informationSome thoughts about energy efficient application execution on NEC LX Series compute clusters
Some thoughts about energy efficient application execution on NEC LX Series compute clusters G. Wellein, G. Hager, J. Treibig, M. Wittmann Erlangen Regional Computing Center & Department of Computer Science
More informationPreviously. ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems. Today. Variation Types. Fabrication
ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Previously Understand how to model transistor behavior Given that we know its parameters V dd, V th, t OX, C OX, W, L, N A Day
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationVariation-Resistant Dynamic Power Optimization for VLSI Circuits
Process-Variation Variation-Resistant Dynamic Power Optimization for VLSI Circuits Fei Hu Department of ECE Auburn University, AL 36849 Ph.D. Dissertation Committee: Dr. Vishwani D. Agrawal Dr. Foster
More informationEECS 427 Lecture 11: Power and Energy Reading: EECS 427 F09 Lecture Reminders
EECS 47 Lecture 11: Power and Energy Reading: 5.55 [Adapted from Irwin and Narayanan] 1 Reminders CAD5 is due Wednesday 10/8 You can submit it by Thursday 10/9 at noon Lecture on 11/ will be taught by
More informationScaling of MOS Circuits. 4. International Technology Roadmap for Semiconductors (ITRS) 6. Scaling factors for device parameters
1 Scaling of MOS Circuits CONTENTS 1. What is scaling?. Why scaling? 3. Figure(s) of Merit (FoM) for scaling 4. International Technology Roadmap for Semiconductors (ITRS) 5. Scaling models 6. Scaling factors
More informationCMPE12 - Notes chapter 1. Digital Logic. (Textbook Chapter 3)
CMPE12 - Notes chapter 1 Digital Logic (Textbook Chapter 3) Transistor: Building Block of Computers Microprocessors contain TONS of transistors Intel Montecito (2005): 1.72 billion Intel Pentium 4 (2000):
More informationNumbering Systems. Computational Platforms. Scaling and Round-off Noise. Special Purpose. here that is dedicated architecture
Computational Platforms Numbering Systems Basic Building Blocks Scaling and Round-off Noise Computational Platforms Viktor Öwall viktor.owall@eit.lth.seowall@eit lth Standard Processors or Special Purpose
More informationModern Computer Architecture
Modern Computer Architecture Lecture1 Fundamentals of Quantitative Design and Analysis (II) Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University 1.4 Trends in Technology Logic: transistor density 35%/year,
More informationDesigning Information Devices and Systems I Fall 2015 Anant Sahai, Ali Niknejad Homework 8. This homework is due October 26, 2015, at Noon.
EECS 16A Designing Information Devices and Systems I Fall 2015 Anant Sahai, Ali Niknejad Homework 8 This homework is due October 26, 2015, at Noon. 1. Nodal Analysis Or Superposition? (a) Solve for the
More informationCounters. We ll look at different kinds of counters and discuss how to build them
Counters We ll look at different kinds of counters and discuss how to build them These are not only examples of sequential analysis and design, but also real devices used in larger circuits 1 Introducing
More informationVARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects
VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects Smruti R. Sarangi, Brian Greskamp, Radu Teodorescu, Jun Nakano, Abhishekh Tiwarii and Josep Torrellas University of
More informationLecture 3, Performance
Repeating some definitions: Lecture 3, Performance CPI MHz MIPS MOPS Clocks Per Instruction megahertz, millions of cycles per second Millions of Instructions Per Second = MHz / CPI Millions of Operations
More informationImpact of Scaling on The Effectiveness of Dynamic Power Reduction Schemes
Impact of Scaling on The Effectiveness of Dynamic Power Reduction Schemes D. Duarte Intel Corporation david.e.duarte@intel.com N. Vijaykrishnan, M.J. Irwin, H-S Kim Department of CSE, Penn State University
More informationQualitative vs Quantitative metrics
Qualitative vs Quantitative metrics Quantitative: hard numbers, measurable Time, Energy, Space Signal-to-Noise, Frames-per-second, Memory Usage Money (?) Qualitative: feelings, opinions Complexity: Simple,
More informationToday. ESE534: Computer Organization. Why Care? Why Care. Scaling. ITRS Roadmap
ESE534: Computer Organization Day 5: September 19, 2016 VLSI Scaling 1 Today VLSI Scaling Rules Effects Historical/predicted scaling Variations (cheating) Limits Note: Day 5 and 6 most gory MOSFET equations
More information2.1. Unit 2. Digital Circuits (Logic)
2.1 Unit 2 Digital Circuits (Logic) 2.2 Moving from voltages to 1's and 0's ANALOG VS. DIGITAL volts volts 2.3 Analog signal Signal Types Continuous time signal where each voltage level has a unique meaning
More informationChapter 1. Binary Systems 1-1. Outline. ! Introductions. ! Number Base Conversions. ! Binary Arithmetic. ! Binary Codes. ! Binary Elements 1-2
Chapter 1 Binary Systems 1-1 Outline! Introductions! Number Base Conversions! Binary Arithmetic! Binary Codes! Binary Elements 1-2 3C Integration 傳輸與介面 IA Connecting 聲音與影像 Consumer Screen Phone Set Top
More informationMolecular Dynamics Simulations
MDGRAPE-3 chip: A 165- Gflops application-specific LSI for Molecular Dynamics Simulations Makoto Taiji High-Performance Biocomputing Research Team Genomic Sciences Center, RIKEN Molecular Dynamics Simulations
More informationSection 3: Combinational Logic Design. Department of Electrical Engineering, University of Waterloo. Combinational Logic
Section 3: Combinational Logic Design Major Topics Design Procedure Multilevel circuits Design with XOR gates Adders and Subtractors Binary parallel adder Decoders Encoders Multiplexers Programmed Logic
More informationLecture 5: Performance (Sequential) James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 5: Performance (Sequential) James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L05 S1, James C. Hoe, CMU/ECE/CALCM, 2018 18 447 S18 L05 S2, James C. Hoe, CMU/ECE/CALCM,
More informationCMP N 301 Computer Architecture. Appendix C
CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)
More informationLecture 23. Dealing with Interconnect. Impact of Interconnect Parasitics
Lecture 23 Dealing with Interconnect Impact of Interconnect Parasitics Reduce Reliability Affect Performance Classes of Parasitics Capacitive Resistive Inductive 1 INTERCONNECT Dealing with Capacitance
More informationLecture 2: CMOS technology. Energy-aware computing
Energy-Aware Computing Lecture 2: CMOS technology Basic components Transistors Two types: NMOS, PMOS Wires (interconnect) Transistors as switches Gate Drain Source NMOS: When G is @ logic 1 (actually over
More informationSkew-Tolerant Circuit Design
Skew-Tolerant Circuit Design David Harris David_Harris@hmc.edu December, 2000 Harvey Mudd College Claremont, CA Outline Introduction Skew-Tolerant Circuits Traditional Domino Circuits Skew-Tolerant Domino
More informationPARADE: PARAmetric Delay Evaluation Under Process Variation * (Revised Version)
PARADE: PARAmetric Delay Evaluation Under Process Variation * (Revised Version) Xiang Lu, Zhuo Li, Wangqi Qiu, D. M. H. Walker, Weiping Shi Dept. of Electrical Engineering Dept. of Computer Science Texas
More informationLecture: Pipelining Basics
Lecture: Pipelining Basics Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4:
More informationVLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1
VLSI Design Adder Design [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1 Major Components of a Computer Processor Devices Control Memory Input Datapath
More informationLECTURE 28. Analyzing digital computation at a very low level! The Latch Pipelined Datapath Control Signals Concept of State
Today LECTURE 28 Analyzing digital computation at a very low level! The Latch Pipelined Datapath Control Signals Concept of State Time permitting, RC circuits (where we intentionally put in resistance
More informationACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS
ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS Bojan Musizza, Dejan Petelin, Juš Kocijan, Jožef Stefan Institute Jamova 39, Ljubljana, Slovenia University of Nova Gorica Vipavska 3, Nova Gorica, Slovenia
More informationProfessor Lee, Yong Surk. References. Topics Microprocessor & microcontroller. High Performance Microprocessor Architecture Overview
This lecture was mae by a generous contribution of C & S Technology corporation. (http://( http://www.cnstec.com) There is no copyright on this lecture. Processor Laboratory homepage, (http://( http://mpu.yonsei.ac.kr)
More informationWhere Does Power Go in CMOS?
Power Dissipation Where Does Power Go in CMOS? Dynamic Power Consumption Charging and Discharging Capacitors Short Circuit Currents Short Circuit Path between Supply Rails during Switching Leakage Leaking
More informationComputer Architecture
Lecture 2: Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture CPU Evolution What is? 2 Outline Measurements and metrics : Performance, Cost, Dependability, Power Guidelines
More informationDatapath Component Tradeoffs
Datapath Component Tradeoffs Faster Adders Previously we studied the ripple-carry adder. This design isn t feasible for larger adders due to the ripple delay. ʽ There are several methods that we could
More informationExascale Computing for Radio Astronomy: GPU or FPGA?
Exascale Computing for Radio Astronomy: GPU or FPGA? Kees van Berkel MPSoC 2016, Nara, Japan 2016, July 14 Radio Astronomy: Herculus A (a.k.a. 3C 348) optically invisible jets, one-and-a-half million light-years
More informationτ gd =Q/I=(CV)/I I d,sat =(µc OX /2)(W/L)(V gs -V TH ) 2 ESE534 Computer Organization Today At Issue Preclass 1 Energy and Delay Tradeoff
ESE534 Computer Organization Today Day 8: February 10, 2010 Energy, Power, Reliability Energy Tradeoffs? Voltage limits and leakage? Variations Transients Thermodynamics meets Information Theory (brief,
More informationEE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining
Slide 1 EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining Slide 2 Topics Clocking Clock Parameters Latch Types Requirements for reliable clocking Pipelining Optimal pipelining
More informationA CUDA Solver for Helmholtz Equation
Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College
More informationCMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design
CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19
More informationComputer Architecture ELEC2401 & ELEC3441
Last Time Pipeline Hazard Computer Architecture ELEC2401 & ELEC3441 Lecture 8 Pipelining (3) Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Structural Hazard Hazard Control
More informationParallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems
More information