Saving Power in the Data Center. Marios Papaefthymiou

Similar documents
EE241 - Spring 2003 Advanced Digital Integrated Circuits

Where Does Power Go in CMOS?

CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II

Lecture 9: Clocking, Clock Skew, Clock Jitter, Clock Distribution and some FM

EE115C Winter 2017 Digital Electronic Circuits. Lecture 6: Power Consumption

Spiral 2 7. Capacitance, Delay and Sizing. Mark Redekopp

Recall the essence of data transmission: How Computers Work Lecture 11

Last Lecture. Power Dissipation CMOS Scaling. EECS 141 S02 Lecture 8

Lecture 23. CMOS Logic Gates and Digital VLSI I

ECE321 Electronics I

EE 466/586 VLSI Design. Partha Pande School of EECS Washington State University

Control gates as building blocks for reversible computers

Lecture 7 Circuit Delay, Area and Power

! Memory. " RAM Memory. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell. " Used in most commercial chips

Serial Parallel Multiplier Design in Quantum-dot Cellular Automata

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Realization of 2:4 reversible decoder and its applications

Itanium TM Processor Clock Design

Lecture 6 Power Zhuo Feng. Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 2010

Lecture 21: Packaging, Power, & Clock


14 Gb/s AC Coupled Receiver in 90 nm CMOS. Masum Hossain & Tony Chan Carusone University of Toronto

CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

CIS 371 Computer Organization and Design

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

CSE140L: Components and Design Techniques for Digital Systems Lab. Power Consumption in Digital Circuits. Pietro Mercati

EE371 - Advanced VLSI Circuit Design

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

A Novel LUT Using Quaternary Logic

ENERGY AND TIME CONSTANTS IN RC CIRCUITS By: Iwana Loveu Student No Lab Section: 0003 Date: February 8, 2004

Lecture 15: Scaling & Economics

A Novel Ternary Content-Addressable Memory (TCAM) Design Using Reversible Logic

PHYS225 Lecture 9. Electronic Circuits

Introduction to CMOS VLSI Design (E158) Lecture 20: Low Power Design

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory

CMOS Digital Integrated Circuits Lec 13 Semiconductor Memories

EE241 - Spring 2006 Advanced Digital Integrated Circuits

Lecture 25. Dealing with Interconnect and Timing. Digital Integrated Circuits Interconnect

Circuit for Revisable Quantum Multiplier Implementation of Adders with Reversible Logic 1 KONDADASULA VEDA NAGA SAI SRI, 2 M.

Power in Digital CMOS Circuits. Fruits of Scaling SpecInt 2000

VLSI Signal Processing

Dynamic operation 20

EECS 427 Lecture 11: Power and Energy Reading: EECS 427 F09 Lecture Reminders

Xarxes de distribució del senyal de. interferència electromagnètica, consum, soroll de conmutació.

Semiconductor Memories

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Power Minimization of Full Adder Using Reversible Logic

April 2004 AS7C3256A

TU Wien. Energy Efficiency. H. Kopetz 26/11/2009 Model of a Real-Time System

Fig. 1 CMOS Transistor Circuits (a) Inverter Out = NOT In, (b) NOR-gate C = NOT (A or B)

EECS240 Spring Today s Lecture. Lecture 2: CMOS Technology and Passive Devices. Lingkai Kong EECS. EE240 CMOS Technology

SEMICONDUCTOR MEMORIES

Stack Sizing for Optimal Current Drivability in Subthreshold Circuits REFERENCES

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

Designing Information Devices and Systems I Fall 2015 Anant Sahai, Ali Niknejad Homework 8. This homework is due October 26, 2015, at Noon.

Low-power adiabatic 9T static random access memory

RC, RL, and LCR Circuits

Successive Approximation ADCs

arxiv: v1 [cs.ar] 11 Dec 2017

CMOS Inverter. Performance Scaling

Basic Logic Gate Realization using Quantum Dot Cellular Automata based Reversible Universal Gate

Chapter 8. Low-Power VLSI Design Methodology

L15: Custom and ASIC VLSI Integration

Digital Electronics Final Examination. Part A

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

EE241 - Spring 2000 Advanced Digital Integrated Circuits. References

PERFORMANCE IMPROVEMENT OF REVERSIBLE LOGIC ADDER

Digital System Clocking: High-Performance and Low-Power Aspects. Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M.

DESIGN AND ANALYSIS OF A FULL ADDER USING VARIOUS REVERSIBLE GATES

Digital Integrated Circuits A Design Perspective. Semiconductor. Memories. Memories

EE115C Winter 2017 Digital Electronic Circuits. Lecture 19: Timing Analysis

EE141- Fall 2002 Lecture 27. Memory EE141. Announcements. We finished all the labs No homework this week Projects are due next Tuesday 9am EE141

Design and Implementation of Carry Adders Using Adiabatic and Reversible Logic Gates

PARADE: PARAmetric Delay Evaluation Under Process Variation *

TAU' Low-power CMOS clock drivers. Mircea R. Stan, Wayne P. Burleson.

CARNEGIE MELLON UNIVERSITY DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING DIGITAL INTEGRATED CIRCUITS FALL 2002

Very Large Scale Integration (VLSI)

ESE570 Spring University of Pennsylvania Department of Electrical and System Engineering Digital Integrated Cicruits AND VLSI Fundamentals

Nyquist-Rate D/A Converters. D/A Converter Basics.

Clock Strategy. VLSI System Design NCKUEE-KJLEE

CMOS Transistors, Gates, and Wires

MOSIS REPORT. Spring MOSIS Report 1. MOSIS Report 2. MOSIS Report 3

ESE570 Spring University of Pennsylvania Department of Electrical and System Engineering Digital Integrated Cicruits AND VLSI Fundamentals

Capacitance - 1. The parallel plate capacitor. Capacitance: is a measure of the charge stored on each plate for a given voltage such that Q=CV

Reducing power in using different technologies using FSM architecture

CSE140L: Components and Design Techniques for Digital Systems Lab. FSMs. Instructor: Mohsen Imani. Slides from Tajana Simunic Rosing

Lecture 1: Gate Delay Models

Literature Review on Multiplier Accumulation Unit by Using Hybrid Adder

Reversible Implementation of Ternary Content Addressable Memory (TCAM) Interface with SRAM

LECTURE 28. Analyzing digital computation at a very low level! The Latch Pipelined Datapath Control Signals Concept of State

Alternating Current Circuits. Home Work Solutions

Basic RL and RC Circuits R-L TRANSIENTS: STORAGE CYCLE. Engineering Collage Electrical Engineering Dep. Dr. Ibrahim Aljubouri

DESIGN AND IMPLEMENTATION OF EFFICIENT HIGH SPEED VEDIC MULTIPLIER USING REVERSIBLE GATES

CMOS Cross Section. EECS240 Spring Dimensions. Today s Lecture. Why Talk About Passives? EE240 Process

EE141-Fall 2011 Digital Integrated Circuits

Performance Enhancement of Reversible Binary to Gray Code Converter Circuit using Feynman gate

NAME SID EE42/100 Spring 2013 Final Exam 1

ECE 574 Cluster Computing Lecture 20

Transcription:

Saving Power in the Data Center Marios Papaefthymiou

Energy Use in Data Centers Several MW for large facilities Responsible for sizable fraction of US electricity consumption 2006: 1.5% or 61 billion kwh [source: EPA, 2007] 2013: 2.5% or 91 billion kwh [source: NDRC, 2014] 2014: 1.8% or 70 billion kwh [source: LBNL, 2016]

Where Does All This Energy Go? Source: Google Power Usage Effectiveness (PUE) = ( IT + Overhead ) / IT Global average ~1.7 [source: Uptime Institute 2014 Data Center Survey] Industry leaders pushing PUE below 1.1

IT Energy Server Maximum power consumption Server utilization (pushing 50% in hyperscale data centers) Ability to scale server power with utilization (power proportionality) > 80% of IT energy Storage SSD at 6W/disk Increases in HDD efficiency to continue, approaching SSD Energy per year projected <10 billion kwh beyond 2020 Network Increases in W/port to continue Energy per year projected <2 billion kwh beyond 2020

Energy Consumption Y2K: No Problem Manageable current supply requirements (10s of Amps) Thermal limits safely away o Chips not melting Cloud / mobile revolution still at its infancy o Limited motivation to lower energy requirements Voltage scaling kept power requirements under check

Performance Scaling 1990 2000 Performance (GOPS) 10 1 0.1 0.01 0.001 1990 1995 2000

Voltage Scaling 1990 2000 Supply Voltage (V) 5 1.5um 4 3 2 1 1um 0.75um 0.5um 0.25um 1990 1995 2000

Energy Consumption = f (Voltage) Constant supply voltage V +

Energy Consumption = f (Voltage) ½ CV 2 consumed on device resistance Constant supply voltage V + ½ CV 2 stored in capacitive load C

Energy Consumption = f (Voltage) Constant supply voltage V + ½ CV 2 stored in capacitor C is consumed on device resistor Total energy consumed = ½ CV 2 + ½ CV 2 = CV 2

Voltage Scaling Post-2000 Supply Voltage (V) 5 4 3 2 1 1990 1995 2000 2005 2010

Performance Scaling Post-2000 Performance (GOPS) 100 10 1 0.1 0.01 Chips reached their thermal limits 0.001 1990 1995 2000 2005 2010

Now What? 2000-present: Let s turn off whatever is not used o Clock gating o Power gating o Multi-core architectures o Software-controlled power management Limited ability to scale down server power with utilization: Deeper power down longer idle-toactive delayed response times At the end of the day, we are basically stuck!

Fundamental Physical Limitation Constant supply voltage V + Total energy consumed = CV 2

Let s Modify the Voltage Supply Gradually transitioning supply voltage Load C V time

Energy Consumption for Gradual Charging (RC/T) CV 2 consumed on R of device Gradually transitioning supply voltage ½ CV 2 stored in load C V T time

Energy Consumption for Gradual Discharging (RC/T) CV 2 consumed on R of device Gradually transitioning supply voltage ½ CV 2 stored in C returns to supply V 2T time

Putting it All Together Total energy consumed to charge/discharge ~= 2 ( RC / T ) CV 2 Gradual recycling of charge saves energy (assuming RC/T < 2) In fact, the slower the better

Intrinsic Energy Requirements & Reversibility of Computation Fundamental question: What are the intrinsic energy requirements of computation? Main result (Landauer, 1961): Energy requirements can asymptotically approach zero, provided the computation is reversible (i.e., retains all information required to reproduce the initial state of the computing machine) Argument based on statistical thermodynamics: Loss of information change in system entropy energy converted to heat (i.e., wasted)

Reversible Turing Machines and Universal Gates Logically reversible Turing Machines (Bennett, 1973) o Compute just like a conventional TM, but keep intermediate results o Output final result o Reverse operation, disposing of intermediate results and returning the machine to its original state Reversible universal gates (Fredkin, Toffoli, 1982) o Embed function into larger space to yield 1-1 mapping between inputs and outputs Feynman Lectures on Computation, T. Hey and R. Allen eds., 1996

A Universal Reversible Gate A B C A B C A = A If A = 1, then B = B and C = C If A = 0, then B = C and C = B

Early Prototypes ( 90s) Reversible pipelines + split-level charge-recovery logic: Pendulum computer (Knight et al., 1995-2000) Variety of irreversible charge-recovery circuit families AC computers (Athas et al., 1995-2000) o Focus on simple reversible function Gradual charge/discharge through inductors or capacitor ladders Issues o High overheads: Too many bits/gates o High complexity: Too many wires o Slow operation

Pop Quiz Give the simplest interesting reversible function. ( interesting is defined as used in every digital system and consumes a lot of power. )

Pop Quiz Give the simplest interesting reversible function. ( interesting is defined as used in every digital system and consumes a lot of power. ) Identity function a.k.a. F(x) = x clock (in synchronous digital systems)

Chip Designs with Resonant Clock Mesh GHz clock CICC 06 Distribution Networks IBM Cell BE ISSCC 09 2GHz FPU A-SSCC 10 IBM Power8 ISSCC 14 AMD Piledriver ISSCC 12 IBM System z ISSCC 15 DWT ASIC ISLPED 03 GHz FIR VLSI 07 2003 CICC 07 2015 200MHz 250nm area < 1mm 2 silicon prototypes ARM 926EJ-S ESSCIRC 09 AMD Steamroller ISSCC 15 5+ GHz 22nm area > 20mm 2 high-volume commercial CPUs

Clock Mesh Typical solution for multi-ghz processors Pre-mesh Clock Distribution Post-gater Clock Distribution Mesh Drivers Clock Mesh Pros All-metal mesh shorts clock buffer outputs very low skews Mesh isolates local timing simplifies late-design ECOs Cons Large capacitance of clock mesh leads to high power consumption

Provides all the performance benefits of a clock mesh AND reduces power On-chip inductors + Clock Mesh Resonant Clock Mesh Pros All-metal mesh shorts clock buffer outputs very low skews Mesh isolates local timing simplifies late-design ECOs Post-gater Clock Distribution Cons Large capacitance of clock mesh leads to high power consumption

4+ GHz x86 Piledriver Core 5 rows with ~20 inductors each Sathe et al., ISSCC 12 High-volume 64-bit x86 core in 32nm, 11 metal layers 100 on-chip inductors, 0.7nH to 1.2nH Up to 30% clock power savings 5% to 10% reduction in total chip power, depending on workload profile

Simplified Model of Dual-Mode Clock Network

Global Clock Organization and Distribution Large variation of capacitance across the chip Palette of 5 inductors (0.7nH to 1.2nH) Integer linear program to select inductor values and size clock wires to minimize clock skew

Inductor Design ~100 um

Simulated Clock Waveforms

Clock Power Savings vs Frequency

Resonant Mega-Mesh for IBM z13 TM Shan et al., VLSI Symp 2015 545mm 2 (80% of chip area), encompassing 8 cores and L3 Single mesh enables 2x clock frequency in L3 reduced memory latency 22nm high-k CMOS SOI, 17 metal layers Operating range: 4.5GHz to 5.5GHz 50% savings in final-stage clock mesh power 8% savings in total chip power

How About Logic? Total energy consumed to charge/discharge ~= 2 ( RC / T ) CV 2 Gradual recycling of charge saves energy (assuming RC/T < 2) In fact, the slower the better If operation is too slow, then result is not interesting. True for all-metal clock networks, but is it true for logic?

13.8µW Binaural Dual-Microphone ANSI S1.11 Filter Bank for Hearing Aids Wu et al., ISSCC 2017 10x reduction in energy consumption per input in 65nm over published state of the art in 40nm

Conclusion Energy-recycling can save significant amounts of power in high-end server chips (5% to 10% of total chip power) without sacrificing performance Successfully deployed in clock distribution networks of highvolume multi-ghz server processors (AMD, IBM) Significant potential for reducing energy consumption in logic Labor intensive custom logic design Increased latency due to micropipelining Large capacitors require large inductors

Acknowledgments Alex Ishii, Cyclos co-founder My U-M students: Suhwan Kim (Seoul National U.) Conrad Ziesler (Apple) Joohee Kim (Cyclos) Juang-Ying Chueh (TSMC) Visvesh Sathe (U. Washington) Jerry Kao (GlobalFoundries) Wei-Hsiang Ma (Intel) Tai-Chuan Ou (Apple) ARO, DARPA, NSF