L16: Power Dissipation in Digital Systems 1
Problem #1: Power Dissipation/Heat Power (Watts) 100000 10000 1000 100 10 1 0.1 4004 80088080 8085 808686 386 486 Pentium proc 18KW 5KW 1.5KW 500W 1971 1974 1978 1985 199 000 004 008 Year Power Density (W/cm) 10000 1000 100 10 1 4004 8008 8080 8086 Rocket Nozzle Nuclear Reactor Hot Plate 8085 86 386 486 Sun s Surface P6 Pentium proc 1970 1980 1990 000 010 Year Courtesy Intel (S. Borkar) How do you cool these chips?? heat sink chip
Problem #: Energy Consumption Nominal Capacity (Watt-hours/lb) 50 Rechargable Lithium 40 Ni-Metal Hydride 30 0 Nickel-Cadmium 10 0 65 70 75 80 85 90 95 Year (from Jon Eager, Gates Inc., S. Watanabe, Sony Inc.) The Energy Problem 7.5 cm 3 AA battery Alkaline: ~10,000J What can One Joule of energy do? Battery (40+ lbs) Mow your lawn for 1 ms Operate a processor for ~ 7s No Moore s law for batteries Today: Understand where power goes and ways to manage it Send a 1 Megabyte file over 80.11b 3
Dynamic Energy Dissipation V DD Charging E 0 1 = C L V DD Discharging V DD i DD E cap = 1/C L V DD R P R P IN =0 E diss, RP = 1/C L V DD IN =1 E diss,rn =1/C L V DD R N C L R N C L P = C L V DD f clk 4
The Transition Activity Factor α 0 >1 Current Input 00 Next Input 00 Output Transition 1 > 1 A B Z 00 01 1 > 1 00 00 01 01 01 01 10 10 11 00 01 10 11 00 1 > 1 1 > 0 1 > 1 1 > 1 1 > 1 1 > 0 1 > 1 Assume inputs (A,B) arrive at f and are uniformly distributed What is the average power dissipation? 10 01 1 > 1 10 10 10 11 1 > 1 1 > 0 α 0 >1 = 3/16 11 00 0 > 1 11 01 0 > 1 11 11 10 11 0 > 1 0 > 0 P = α 0 >1 C L V DD f 5
Junction (Silicon) Temperature Simple Scenario Silicon T j -T a =R θja P D R θja is the thermal resistance between silicon and Ambient P D Realistic Scenario Silicon Case Sink T J T C T S T A T J R θjc T J T C P D R θja R θcs T S T A R θsa T j =T a + R θja P D T A Make this as low as possible R θca = R θcs +R θsa is minimized by facilitating heat transfer (bolt case to extended metal surface heat sink) 6
Intel Pentium 4 Thermal Guidelines Pentium 4 @ 3.06 GHz dissipates 81.8W! Maximum T C = 69 C R CA < 0.3 C/W for 50 C ambient Typical chips dissipate 0.5-1W (cheap packages without forced air cooling) Temp ( o C) Cache 70 C Execution core 10 o C Integer & FP ALUs Courtesy of Intel (Ram Krishnamurthy) 7
Power Reduction Strategies P = α 0 >1 C L V DD f Reduce Transition Activity or Switching Events Reduce Capacitance (e.g., keep wires short) Reduce Power Supply Voltage Frequency is typically fixed by the application, though this can be adjusted to control power Optimize at all levels of design hierarchy 8
Clock Gating is a Good Idea! Clock gating reduces activity and is the most common low-power technique used today Global Clock Enable_Adder Adder Off Adder Clock + Multiplier On X Enable_Multiplier Multiplier Clock 100 s of different clocks in a microprocessor Clock Gating Reduces Energy, does it reduce Power? 9
Does your GHz Processor run at a GHz? Processor Chip Activity Control Thermal Sensor Note that there is a difference between average and peak power On-chip thermal sensor (diode based), measures the silicon temperature If the silicon junction gets too hot (say 15 C), then the activity is reduced (e.g., reduce clock rate or use clock gating) Use of Thermal Feedback 10
Power Supply Resonance L board L package R grid Board decap On-die decap Courtesy of Motorola (David Blaauw) Switching currents Can write a Virus to Activate Power Supply Resonance! 00Mhz Design 11
Number Representation: Two s s Complement vs. Sign Magnitude Two s complement Sign-Magnitude -4-6 -5-7 +0 1111 0000 1110 0001 +1 1101 0010 1100 0011 + +3-3 1011 0100 +4-1010 1001-1 1000 0111 0110 0101 +6 +5-0 +7 Consider a 16 bit bus where inputs toggles between +1 and 1 (i.e., a small noise input) Which representation is more energy efficient? 1
Bus Coding to Reduce Activity D Q Majority Function Extra bit to indicated if the bus is inverted invert N Input Output Data Bus [Stan94] 13
Time Sharing is a Bad Idea Time Sharing Increases Switching Activity 14
Not just a 6-16 1 Issue: Cool Software??? MEMORY address CPU address 16 a[0] a[1] a[] a[3] 0111111100000000 0111111100000001 0111111100000010 0111111100000011 b[0] b[1] b[] b[3] 1000000000000000 1000000000000001 1000000000000010 1000000000000011 float a [56], b[56]; float pi= 3.14; for (i = 0; i < 55; i++) { a[i] = sin(pi * i /56); b[i] = cos(pi * i /56); } 51(8)++4+8+16+3+64+18+56 = 4607 bit transitions float a [56], b[56]; float pi= 3.14; for (i = 0; i < 55; i++) {a[i] = sin(pi * i /56);} for (i = 0; i < 55; i++) {b[i] = cos(pi * i /56);} (8)+(+4+8+16+3+64+18+56) = 1030 transitions 15
Glitching Transitions Chain Topology A B Tree Topology A B C D + C + + + D + + (A+B) + (C+D) (((A+B) + C)+D) Balancing paths reduces glitching transitions Structures such as multipliers have lot of glitching transitions Keeping logic depths short (e.g., pipelining) reduces glitching 16
Reduce Supply Voltage : But is it Free? V DD V DD G + V DD - t =0+ K V S ( V DD V T D ) C L IN OUT Delay = C L V i D S = k C L V DD ( V DD DD V T ) ( V DD V V T ) 1 V DD V DD from V to 1V, energy by x4, delay x 17
Transistors Are Free (What do you do with a Billion Transistors?) f=1ghz V DD =V IN f = 500Mhz V DD =1V IN IN f = 500Mhz V DD =1V X X X OUT SELECT OUT P serial = C mult f P parallel = (C mult 1 f /) = P serial /4 Trade Area for Low Power 18
Algorithmic Workload Compare Current Image... Receiver just updates...to Previous Image 19 Frequency of Occurrence 0.06 0.04 0.0 0.00 0 500 1000 1500 000 Number of IDCTs per Frame Exploit Time Varying Algorithmic Workload To Vary the Power Supply Voltage
Dynamic Voltage Scaling (DVS) Fixed Power Supply ACTIVE IDLE Variable Power Supply ACTIVE E FIXED = ½ C V DD 1.0 E VARIABLE = ½ C (V DD /) = E FIXED / 4 Normalized Energy 0.8 0.6 0.4 0. Fixed Supply Variable Supply 0 0 0. 0.4 0.6 0.8 1.0 Normalized Workload [Gutnik97] 0
DVS on a Processor Digitally adjustable DC-DC converter powers SA-1110 core 3.6V 5 Controller V out SA-1110 µos Control µos selects appropriate clock frequency based on workload and latency constraints 1
Hardware vs. Software Flexibility 0.5nJ/Op DSP 1nJ/Op Embedded Processor 0.1-1pJ/Op FPGA Direct Mapped Hardware Courtesy of R. Brodersen, J. Rabaey, TI, ARM/StrongARM Energy/Operation
Energy Efficiency of Software Processor (StrongARM-1100) [A. Sinha, DAC] FPGA (Xilinx) CLB CLB CLB CLB 45 40 35 [Montanaro, JSSC 96] I/O CLB 9% 5% [Kusse 98, UCB] Power (%) 30 5 0 15 10 Clock 1% 65% Interconnect 5 0 Cache Control GCLK EBOX I/O,PLL Software Energy Dissipation has Large Overhead 3
Trends: Leakage and Power Gating V DD Switching (computing) E = CV DD C V DD E = V DD I 0 10 0 1 C Leakage (standby) 10 -V T /S In today s 65nm CMOS Technology : 30-50% of power is leakage! Total Energy/Switching Energy Duty Cycle (%) Low V T devices are leaky - Use a High V T device is used to gate leakage current Sleep 4
Next Generation Low-Power Digital: Sub-Threshold Operation Strong Inversion Operation: fast, power-hungry V DD = 0.18V 10 0 Normalized I D 10 10 4 Subthreshold Operation: slow, minimum energy operation Data Memory Control logic 10 6 Butterfly Datapath Twiddle ROMs 10 8 0 0. 0.4 0.6 0.8 1 Normalized V GS Exploit Sub-threshold Operation (V DD < V T ) for Sensor Circuits 5
Trends: Energy Scavenging MEMS Generator Power Harvesting Shoes Jose Mur Miranda/ Jeff Lang Vibration-to-Electric Conversion ~ 10µW Joe Paradiso (Media Lab) After 3-6 steps, it provides 3 ma for 0.5 sec ~10mW 6