School of EECS Seoul National University

Similar documents
EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis

VLSI Signal Processing

Chapter 8. Low-Power VLSI Design Methodology

CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues

L15: Custom and ASIC VLSI Integration

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

UNIVERSITY OF CALIFORNIA RIVERSIDE. Functional Partitioning for Low Power

EE241 - Spring 2001 Advanced Digital Integrated Circuits

PLA Minimization for Low Power VLSI Designs

Energy Delay Optimization

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Design for Manufacturability and Power Estimation. Physical issues verification (DSM)

Timing Issues. Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolić. January 2003

CMPEN 411 VLSI Digital Circuits Spring Lecture 14: Designing for Low Power

The Linear-Feedback Shift Register

EE241 - Spring 2000 Advanced Digital Integrated Circuits. Announcements

Reducing power in using different technologies using FSM architecture

Lecture 9: Clocking, Clock Skew, Clock Jitter, Clock Distribution and some FM

Xarxes de distribució del senyal de. interferència electromagnètica, consum, soroll de conmutació.

EE115C Winter 2017 Digital Electronic Circuits. Lecture 19: Timing Analysis

EEC 216 Lecture #2: Metrics and Logic Level Power Estimation. Rajeevan Amirtharajah University of California, Davis

ECE 407 Computer Aided Design for Electronic Systems. Simulation. Instructor: Maria K. Michael. Overview

EE 466/586 VLSI Design. Partha Pande School of EECS Washington State University

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

Datapath Component Tradeoffs

Low Power Design Methodologies and Techniques: An Overview

GMU, ECE 680 Physical VLSI Design 1

Automating RT-Level Operand Isolation to Minimize Power Consumption in Datapaths

Implementation of Clock Network Based on Clock Mesh

CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II

A Novel LUT Using Quaternary Logic

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

Retiming. delay elements in a circuit without affecting the input/output characteristics of the circuit.

Clock Strategy. VLSI System Design NCKUEE-KJLEE

ELEC516 Digital VLSI System Design and Design Automation (spring, 2010) Assignment 4 Reference solution

FPGA Implementation of a Predictive Controller

Lecture 8-1. Low Power Design

Managing Physical Design Issues in ASIC Toolflows Complex Digital Systems Christopher Batten February 21, 2006

Chapter 5 CMOS Logic Gate Design

VLSI System Design Part V : High-Level Synthesis(2) Oct Feb.2007

TAU' Low-power CMOS clock drivers. Mircea R. Stan, Wayne P. Burleson.

Stack Sizing for Optimal Current Drivability in Subthreshold Circuits REFERENCES

AN IMPROVED LOW LATENCY SYSTOLIC STRUCTURED GALOIS FIELD MULTIPLIER

EE241 - Spring 2006 Advanced Digital Integrated Circuits

Introduction to CMOS VLSI Design (E158) Lecture 20: Low Power Design

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

COVER SHEET: Problem#: Points

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Issues on Timing and Clocking

THE UNIVERSITY OF MICHIGAN. Faster Static Timing Analysis via Bus Compression

Mark Redekopp, All rights reserved. Lecture 1 Slides. Intro Number Systems Logic Functions

CSE477 VLSI Digital Circuits Fall Lecture 20: Adder Design

CSE370: Introduction to Digital Design

Design and Implementation of Efficient Modulo 2 n +1 Adder

Clock-Gating and Its Application to Low Power Design of Sequential Circuits

VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

Pipelining and Parallel Processing

Transformation Techniques for Real Time High Speed Implementation of Nonlinear Algorithms

MODULE 5 Chapter 7. Clocked Storage Elements

Skew-Tolerant Circuit Design

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes

KINGS COLLEGE OF ENGINEERING DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING QUESTION BANK

Power-Saving Scheduling for Weakly Dynamic Voltage Scaling Devices

Novel Bit Adder Using Arithmetic Logic Unit of QCA Technology

CMP 338: Third Class

Area-Time Optimal Adder with Relative Placement Generator

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

Lecture 23. Dealing with Interconnect. Impact of Interconnect Parasitics

Design of Arithmetic Logic Unit (ALU) using Modified QCA Adder

MODULE III PHYSICAL DESIGN ISSUES

Next, we check the race condition to see if the circuit will work properly. Note that the minimum logic delay is a single sum.

A Low-Error Statistical Fixed-Width Multiplier and Its Applications

Status. Embedded System Design and Synthesis. Power and temperature Definitions. Acoustic phonons. Optic phonons

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

Logic BIST. Sungho Kang Yonsei University

Performance, Power & Energy

NTE74HC299 Integrated Circuit TTL High Speed CMOS, 8 Bit Universal Shift Register with 3 State Output

Digital Integrated Circuits A Design Perspective

Clock Skew Scheduling in the Presence of Heavily Gated Clock Networks

Semiconductor Memories

Last Lecture. Power Dissipation CMOS Scaling. EECS 141 S02 Lecture 8

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Proposal to Improve Data Format Conversions for a Hybrid Number System Processor

Power Dissipation. Where Does Power Go in CMOS?

Lecture 34: Portable Systems Technology Background Professor Randy H. Katz Computer Science 252 Fall 1995

Digital System Clocking: High-Performance and Low-Power Aspects. Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M.

Special Nodes for Interface

ESE570 Spring University of Pennsylvania Department of Electrical and System Engineering Digital Integrated Cicruits AND VLSI Fundamentals

Comparative analysis of QCA adders

Table of Content. Chapter 11 Dedicated Microprocessors Page 1 of 25

A Novel Ternary Content-Addressable Memory (TCAM) Design Using Reversible Logic

Analysis and Synthesis of Weighted-Sum Functions

Vidyalankar S.E. Sem. III [CMPN] Digital Logic Design and Analysis Prelim Question Paper Solution

Transcription:

4!4 07$ 8902808 3 School of EECS Seoul National University

Introduction Low power design 3974/:.9 43 Increasing demand on performance and integrity of VLSI circuits Popularity of portable devices Low power design at higher levels of abstraction Faster design space exploration Wider view Higher power reduction Less cost increase

Opportunities for power reduction at every level of abstraction 3974/:.9 43 System Architecture Register-Transfer Gate / Logic Transistor Physical 50-90% 40-70% 30-50% 20-30% 10-20% 5-10% algorithms, HW-SW tradeoffs, supply voltage scaling scheduling, resource binding, operand swapping clock gating, operand isolation, pre-computation, dynamic operand interchange, FSM encoding, bus encoding technology mapping, don t care optimization, de-glitching transistor sizing interconnect capacitance reduction, clock-tree synthesis

Power dissipation in CMOS circuits Dynamic power dissipation (dominant) Short-circuit power dissipation Leakage power dissipation Dynamic power dissipation 2 P = C V f dynamic eff dd clk 2 = α C V f phy dd clk 3974/:.9 43 C eff f clk : effective (switched) capacitance : clock frequency α : switching activity : supply voltage V dd C phy : physical capacitance

Physical/Transistor/Gate-Level Design Interconnect capacitance reduction! 8., %7,38 8947,90 0;0 08 3 Signals having high switching activity are assigned short wires Clock-tree synthesis Clock is a major source of dynamic power dissipation Clock of 200MHz DEC Alpha chip drives 3250pF load, 3.3V supply voltage => 7W (30% of the total power) Clock skews must be controlled within tolerable values Single driver scheme Distributed buffers scheme (preferred)

Transistor sizing! 8., %7,38 8947,90 0;0 08 3 Compute the slack at each gate Sizes of the transistors in the gate are reduced until the slack becomes zero Reduced size => reduced capacitance => reduced power Critical path is not affected Path balancing => reduced glitch => reduced power

Technology mapping! 8., %7,38 8947,90 0;0 08 3 V. Tiwari, P. Ashar, and S. Malik, Technology mapping for low power, Proc. of Design Automation Conference, pp. 74-79, June 1993 Hide nodes with high switching activity inside the gates where they drive smaller load capacitances H L H L H L L H L L

De-glitching! 8., %7,38 8947,90 0;0 08 3 Glitch consumes 10% - 40% of the dynamic power in typical combinational logic circuits B 3 A 3 B 2 A 2 B 1 A 1 B 0 A 0 C 4 C 3 C 2 C 1 FA FA FA FA C 0 S 3 S 2 S 1 S 0 Path balancing Add unit-delay buffers selectively such that the delays of all paths can be made equal

RTL Design Clock gating #% 08 3 Disable clocks to idle part of the circuit Saves clock power and power consumed by registered value change 0 MUX register data register combinational logic 1 control F/F clock

Operand isolation #% 08 3 Exploit output don t cares of large circuit blocks in unused clock cycles Insert latches before the circuit blocks to reduce circuit activity 0 MUX register adder latch multiplier register combinational logic 1 control F/F clock

Pre-computation #% 08 3 Pre-compute the results of subsequent pipeline stages register 0 combinational logic register combinational logic MUX 1 register Pre-computation logic F/F clock

Comparator example #% 08 3 register 0 combinational logic register A>B MUX 1 register A[MSB] B[MSB] F/F

Dynamic operand interchange #% 08 3 T. Ahn and K. Choi, Dynamic operand interchange for low power, Electronics Letters, pp. 2118-2120, Dec. 1997 Switching activity of 16-bit array multiplier 1800 1600 Switching activity 1400 1200 1000 800 600 400 200 0 Sign change

Architecture #% 08 3 I 1 (k1) I 2 (k1) Register1 Register2 I 1 (k) I 2 (k) Estimator Change DFF Execution Unit

FSM encoding #% 08 3 C.-Y. Tsui, M. Pedram, C.-A. Chen, and A.M. Despain, Low power state assignment targeting two- and multilevel logic implementations, Proc. of Int l Conf. on Computer-Aided Design, pp. 82-87, Nov. 1994 Low power state encoding of FSM Reduce switching activity on state bit lines Cost function: S S i j S p H( S, S ) ij i j where p ij is the transition probability from state S i to state S j and H(S i to state S j ) is the Hamming distance between the encodings of the two states Also reduce power consumed in the combinational logic P outputs reg comb. logic P inputs P comb P reg

Bus encoding Reduce number of transitions on high-capacitance, multi-bit buses by encoding the signals Examples Bus-invert coding #% 08 3 M.R. Stan, W.P. Burleson, Bus-invert coding for low-power I/O, IEEE Trans. on VLSI Systems, Vol. 3, No. 1, pp. 49-58, Mar. 1995 Gray coding C. L. Su, Saving power in the control path of embedded processors, IEEE Design and Test of Computers, Vol. 11, No. 4, pp. 24-30, Winter 1994 high-capacitance 00110001 01001100 6 toggles 00110001 0 10110011 1 3 toggles

Architecture-Level Design Supply voltage reduction Quadratic effect of voltage scaling on power 2 P = C V f dynamic eff dd 5V --> 3.3V => 60% power reduction Supply voltage reduction => increased latency clk 7. 90.9:70 0;0 08 3 energy delay 1 5 Vdd 1 5 Vdd

7. 90.9:70 0;0 08 3 Use of optimizing transformation for meeting throughput constraint even with the voltage reduction Concurrency increasing transformation (increased hardware cost ) => critical path reduction Loop unrolling, pipelining, retiming, algebraic transformation, module selection A.P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R.W. Brodersen, Optimizing power using transformation, IEEE Tr. on CAD/ICAS, pp. 12-31, Jan. 1995 Y N =AY N-1 X N --> Y N =A 2 Y N-2 AX N-1 X N Y N-1 =AY N-2 X N-1 Y N-1 =AY N-2 X N-1 X N Y N X N Y N 2D D A A 2 A Y N-2 A X N-1 Y N-1

7. 90.9:70 0;0 08 3 X N Y N X N Y N 2D C eff =1 Voltage=5 Throughput=1 Power=25 A D C eff =1.5 A Voltage=3.7 Throughput=1 Power=20 A 2 A Y N-2 X N-1 Y N-1 X N D Y N C eff =1.5 Voltage=2.9 Throughput=1 Power=12.5 A 2D A 2 A Y N-2 X N-1 D Y N-1

Reduction of effective capacitance Physical capacitance reduction 7. 90.9:70 0;0 08 3 Buses may consume 5-40% of the total power Reducing access to global resource thru clustering R. Mehra, L.M. Guerra, and J.M. Rabaey, Low power architectural synthesis and the impact of exploiting locality, Journal of VLSI Signal Processing, 1996 Global data transfers Local data transfers Adder1 Adder2

Hyperedge models 7. 90.9:70 0;0 08 3 a a a a 1/3 1/3 1/3 1/3 2/3 2/3 b d b 1/3 1/3 d b 1/3 1/6 d b 2/3 1/3 d c 1/3 c 1/3 1/6 c 1/6 1/3 c 1/3 partitioning based on spectral method minimize z=1/2σσ ΣΣ(x i - x j ) 2 A ij subject to x T x=1 => non-trivial solution is the 2nd smallest eigenvector of the Laplacian of the graph Q=D-A

Finding good partitions 7. 90.9:70 0;0 08 3-1 1-1 1

Evaluation of the partitions area : distribution graph power : (number of data transfers) x (area) 7. 90.9:70 0;0 08 3 a b Cluster 1 a b Cluster 1 Cluster 2 c d out1 Cluster 2 c d out1 e out2 e out2 a b a c e e c d b d

Switching activity reduction Increasing data correlation thru operand sharing 7. 90.9:70 0;0 08 3 Operations sharing an operand also share resource Actively increase the chance of operand sharing thru loop interchange, operand reordering, loop unrolling, loop folding Loop interchange for i for j for k for l a=f(k, l) b=f(i, j, k, l) c(i, j) = a - b for k for l a=f(k, l) for i for j b=f(i, j, k, l) c(i, j) = a - b

Operand reordering 4th order LMS adaptive filter 7. 90.9:70 0;0 08 3 x t h 0 x t-1 h 1 x t-2 h 2 x t-3 h 3 Iter Reordering A M0 M1 M2 M3 i (x t, h 0 ) (x t-1, h 1 ) (x t-2, h 2 ) (x t-3, h 3 ) i1 (x t1, h 0 ) (x t, h 1 ) (x t-1, h 2 ) (x t-2, h 3 ) i2 (x t2, h 0 ) (x t1, h 1 ) (x t, h 2 ) (x t-1, h 3 ) i3 (x t3, h 0 ) (x t2, h 1 ) (x t1, h 2 ) (x t, h 3 ) Iter Reordering B M0 M1 M2 M3 i (x t, h 0 ) (x t-1, h 1 ) (x t-2, h 2 ) (x t-3, h 3 ) i1 (x t, h 1 ) (x t-1, h 2 ) (x t-2, h 3 ) (x t1, h 0 ) i2 (x t, h 2 ) (x t-1, h 3 ) (x t2, h 0 ) (x t1, h 1 ) i3 (x t, h 3 ) (x t3, h 0 ) (x t2, h 1 ) (x t1, h 2 )

7. 90.9:70 0;0 08 3 Loop unrolling E. Musoll and J. Cortadella, High-level synthesis techniques for reducing the activity of functional units, Proc. of Int l Symp. on Low Power Design, pp. 99-104, Nov. a0 a1 a2 a3 a4 1995 Low-pass image filter b0 b1 b2 b3 b4 c0 c1 c2 c3 c4 for i=0 to M for j=0 to N out=a[i-1][j-1] / a0 / a[i-1][j] / a1 / a[i-1][j1] / a2 / a[i][j-1] / b0 / a[i][j] / b1 / a[i][j1] / b2 / a[i1][j-1] / c0 / a[i1][j] / c1 / a[i1][j1] / c2 / a0 a1 a2 b0 b1 b2 c0 c1 c2 out0 out1 out2

7. 90.9:70 0;0 08 3 Loop folding D. Kim and K. Choi, Power-conscious high level synthesis using loop folding, Proc. of Design Automation Conference, pp. 441-445, June 1997 Fold two consecutive iterations in such a way that h(i) x[n-i] for y[n] and h(i1) x[(n1)-(i1)] for y[n1] are computed consecutively in one shared multiplier yn [ ] = hxn [ i] i i Significant effects on DSP applications such as filters y[ n] = m h x[ n i] m i y[ n 1] = m h x[( n 1) ( i 1)] ( i 1 ) o o m (1...N-1) m0[n-1] = h0x[n] m1[n-1] = h1x[n-1] out[n-1] = m0[n-1]m1[n-1] loop folding m1[0] = h1x[0] (1...N-2) m0[n-1] = h0x[n] m1[n] = h1x[n] out[n-1] =m0[n-1]m1[n-1] m0[n-2] = h0x[n-1] out[n-2] =m0[n-2]m1[n-2]

7. 90.9:70 0;0 08 3 Binding A. Raghunathan and N. K. Jha, Behavioral synthesis for low power, Proc. of Int l Conf. on Computer Design, pp. 318-322, Oct. 1994 Binding based on edge weighted compatibility graph weight = (1-W t )W c where W t is transition activity and W c is capacitance weight Functional unit and register sharing Controller optimization to reduce power consumed during idle time of functional units use don t cares select the mux port with least transition activity disable loading into registers

Scheduling and binding 7. 90.9:70 0;0 08 3 E. Musoll and J. Cortadella, Scheduling and resource binding for low power, Proc. of Int l Symp. on System Synthesis, pp. 104-109, Apr. 1995 Resource sharing by sibling operations List scheduling is used Operations sharing the same operand (operations in an operand sharing set) are scheduled in control steps as close as possible (higher priority is given) n1 n2 n1 n2 n3 n4 n5 n4 n3 idle n5 traditional modified After functional unit binding, bind registers such that useless power is reduced (no change of inputs to idle functional unit) A few sibling operations available in normal circuits

7. 90.9:70 0;0 08 3 Scheduling and binding A. Raghunathan and N. K. Jha, An iterative improvement algorithm for low power data path synthesis, Proc. of Int l Conf. on Computer-Aided Design, pp. 597-602, Nov. 1995 Thorough power minimization including voltage scaling, clock selection, and module selection as well as scheduling and binding Iterative improvement Pruning for efficiency of the algorithm supply voltage pruning: prune V dd if the lower bound of power at V dd is greater than the best solution seen clock period pruning: T clk i = T s for some integer i => prune other T clk T clk1 < T clk2 and delay t /T clk1 = delay t /T clk2 for all functional unit template t => prune T clk2

7. 90.9:70 0;0 08 3 SCALP (CDFG G, Sample Period T s, Library L) { V min =estimate_min_volt(g, T s, L); V max =5V; best_dp=null; cur_dp=null; for(v dd =V min ; V dd V max ; V dd =V dd V) { if(v dd _prune(g,cur_dp,v dd )) continue; for(csteps=max_csteps; csteps min_csteps; csteps=csteps-1){ if(clk_prune(g, L, csteps)) continue; cur_dp=initial_solution(g, L, V dd, csteps); iterative_improvement(g, L, cur_dp); if(power_est(cur_dp) < power_est(best_dp)) best_dp=cur_dp; } iterative_improvement(g, L, DP) { } do { } for(i=1; i max_moves; i=i1) { } gain i = generate_moves(g, L, DP); append gain i to gain_list; } find subsequence, gain 1 gain k in gain_list so that G=Σgain i is maximized; if(g>0) { accept moves 1 k; } } until(g<0); }

7. 90.9:70 0;0 08 3 Scheduling and binding D. Shin and K. Choi, Low power high level synthesis by increasing data correlation, Proc. of Int l Symp. on Low Power Electronics and Design, pp. 62-67, Aug. 1997 Simultaneous scheduling and binding in such a way that input data correlation between consecutive inputs increase (Modified) list scheduling is used for efficiency DBT (Dual Bit Type) method for estimating switched capacitance in execution units P.E. Landman and J.M. Rabaey, Architectural power analysis: the dual bit type method, IEEE Tr. on VLSI Systems, pp. 173-187, June 1995 n1 n2 n1 n2 n3 n4 n3 n5 n5 n4 traditional list scheduling modified list scheduling

System-Level Design System-level power optimization $ 8902 0;0 08 3 Power estimation/simulation System specification Low-power HW-SW partitioning Low-power compilation Memory mapping Instruction compaction ASIC Processor Core VSP Power-conscious scheduling OSPM ➀➁➂ ➃➄➅ ➆➇➈ 0 Interface Circuits Codec On-chip Instruction Memory On-chip Data Memory Off-chip Memory (RAM, ROM) Bus coding Interface exploration

Power consumption in processors Buses consume significant power Capacitive load at I/O of a chip is three orders of magnitude larger than that of internal nodes Example $ 8902 0;0 08 3 D. Liu and C. Svensson, Power consumption estimation in CMOS VLSI chips, IEEE JSSC, pp. 663-670, June 1994 40 35 Alpha 21064 Intel 80386 Power consumption (%) 30 25 20 15 10 5 0 Gates Clock Wire Offchip Memory

$ 8902 0;0 08 3 Power consumption in portable embedded systems Power consumption in processors becomes more significant as increasing amount of functionality is realized through software Example T. Truman, T. Pering, R. Doering, and R. Brodersen, The InfoPad multimedia terminal: a portable device for wireless information access, IEEE Transactions on Computers, pp. 1073-1087, October 1998 Power consumption (%) 40 35 30 25 20 15 10 5 0 Radio Processor I/O LCD DC/DC etc.

$ 8902 0;0 08 3 Low power design issues L. Benini and G. De Micheli, System-level power optimization: techniques and tools, Proc. of Int l Symp. on Low Power Electronics and Design, pp. 288-293, Aug. 1999 Memory optimization Memory hierarchy, cache size, memory size (related with software transformation), data transfer and placement E.g. large cache size low cache miss high speed and low power, but large capacitance Hardware-software partitioning Power consumption in hardware, software, and interface Instruction-level power optimization Dedicated low-power instruction set, instruction transformation, Variable-voltage Dynamically variable voltage supply Effective Dynamic power management Low-power sleep state Predictive, stochastic Standard (OnNow, ACPI) Interface power minimization Bus encoding