A Random Walk from Async to Sync. Paul Cunningham & Steev Wilcox

Similar documents
CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II

EE115C Winter 2017 Digital Electronic Circuits. Lecture 19: Timing Analysis

Testability. Shaahin Hessabi. Sharif University of Technology. Adapted from the presentation prepared by book authors.

Problem Set 9 Solutions

Timing Issues. Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolić. January 2003

GMU, ECE 680 Physical VLSI Design 1

Boolean Logic Continued Prof. James L. Frankel Harvard University

Timing Constraints in Sequential Designs. 63 Sources: TSR, Katz, Boriello & Vahid

Lecture 9: Clocking, Clock Skew, Clock Jitter, Clock Distribution and some FM

Issues on Timing and Clocking

Synchronous Sequential Circuit

Module - 19 Gated Latches

Implementation of Clock Network Based on Clock Mesh

Sequential Logic Design: Controllers

The Linear-Feedback Shift Register

ECE 407 Computer Aided Design for Electronic Systems. Simulation. Instructor: Maria K. Michael. Overview

UNIVERSITY OF CALIFORNIA

Clock Strategy. VLSI System Design NCKUEE-KJLEE

UNIVERSITY OF CALIFORNIA, BERKELEY College of Engineering Department of Electrical Engineering and Computer Sciences

CPE/EE 422/522. Chapter 1 - Review of Logic Design Fundamentals. Dr. Rhonda Kay Gaede UAH. 1.1 Combinational Logic

Chapter 3. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 3 <1>

Chapter 7. Sequential Circuits Registers, Counters, RAM

Timing Analysis with Clock Skew

Lecture 7: Logic design. Combinational logic circuits

Xarxes de distribució del senyal de. interferència electromagnètica, consum, soroll de conmutació.

Fundamentals of Computer Systems

Chapter 7 Sequential Logic

Laboratory Exercise #8 Introduction to Sequential Logic

Name: Answers. Mean: 83, Standard Deviation: 12 Q1 Q2 Q3 Q4 Q5 Q6 Total. ESE370 Fall 2015

Skew-Tolerant Circuit Design

EE371 - Advanced VLSI Circuit Design

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Menu. Excitation Tables (Bonus Slide) EEL3701 EEL3701. Registers, RALU, Asynch, Synch

TAU 2015 Contest Incremental Timing Analysis and Incremental Common Path Pessimism Removal (CPPR) Contest Education. v1.9 January 19 th, 2015

Design for Variability and Signoff Tips

Gates and Flip-Flops

ELEC Digital Logic Circuits Fall 2014 Sequential Circuits (Chapter 6) Finite State Machines (Ch. 7-10)

Memory Elements I. CS31 Pascal Van Hentenryck. CS031 Lecture 6 Page 1

Lecture 27: Latches. Final presentations May 8, 1-5pm, BWRC Final reports due May 7 Final exam, Monday, May :30pm, 241 Cory

Jin-Fu Li Advanced Reliable Systems (ARES) Lab. Department of Electrical Engineering. Jungli, Taiwan

ELCT201: DIGITAL LOGIC DESIGN

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

Reducing Delay Uncertainty in Deeply Scaled Integrated Circuits Using Interdependent Timing Constraints

INTEGRATED CIRCUITS. For a complete data sheet, please also download:

Chapter 3. Chapter 3 :: Topics. Introduction. Sequential Circuits

CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues

Where are we? Data Path Design

Lecture Outline. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. Total Power. Energy and Power Optimization. Worksheet Problem 1

Digital Circuits ECS 371

State & Finite State Machines

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

TAU 2014 Contest Pessimism Removal of Timing Analysis v1.6 December 11 th,

NTE74HC173 Integrated Circuit TTL High Speed CMOS, 4 Bit D Type Flip Flop with 3 State Outputs

EE241 - Spring 2006 Advanced Digital Integrated Circuits

Latches. October 13, 2003 Latches 1

In data sheets and application notes which still contain NXP or Philips Semiconductors references, use the references to Nexperia, as shown below.

Skew Management of NBTI Impacted Gated Clock Trees

Synchronous Sequential Logic

Vidyalankar S.E. Sem. III [ETRX] Digital Circuits and Design Prelim Question Paper Solution

Designing Sequential Logic Circuits

Computers also need devices capable of Storing data and information Performing mathematical operations on such data

Programmable Logic Devices II

NTE4035B Integrated Circuit CMOS, 4 Bit Parallel In/Parallel Out Shift Register

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis

Clock Skew Scheduling in the Presence of Heavily Gated Clock Networks

Sequential vs. Combinational

EECS 427 Lecture 14: Timing Readings: EECS 427 F09 Lecture Reminders

Name: Answers. Mean: 38, Standard Deviation: 15. ESE370 Fall 2012

ALU, Latches and Flip-Flops

Chapter 2 Process Variability. Overview. 2.1 Sources and Types of Variations

Spiral 2 7. Capacitance, Delay and Sizing. Mark Redekopp

Digital VLSI Design. Lecture 8: Clock Tree Synthesis

P2 (10 points): Given the circuit below, answer the following questions:

Sequential Circuits Sequential circuits combinational circuits state gate delay

EE40 Lec 15. Logic Synthesis and Sequential Logic Circuits

Lecture #4: Potpourri

Topic 8: Sequential Circuits

Stop Watch (System Controller Approach)

CprE 281: Digital Logic

Where are we? Data Path Design. Bit Slice Design. Bit Slice Design. Bit Slice Plan

Chapter 13. Clocked Circuits SEQUENTIAL VS. COMBINATIONAL CMOS TG LATCHES, FLIP FLOPS. Baker Ch. 13 Clocked Circuits. Introduction to VLSI

State & Finite State Machines

Lecture 5. MOS Inverter: Switching Characteristics and Interconnection Effects

Logic Synthesis and Verification

Variation-aware Clock Network Design Methodology for Ultra-Low Voltage (ULV) Circuits

Skew Management of NBTI Impacted Gated Clock Trees

Efficient Circuit Analysis under Multiple Input Switching (MIS) Anupama R. Subramaniam

Cycle Error Correction in Asynchronous Clock Modeling for Cycle-Based Simulation

Next, we check the race condition to see if the circuit will work properly. Note that the minimum logic delay is a single sum.

Lecture 10: Sequential Networks: Timing and Retiming

State and Finite State Machines

Synchronizers, Arbiters, GALS and Metastability

Lecture 25. Dealing with Interconnect and Timing. Digital Integrated Circuits Interconnect

Dual D Flip-Flop with Set and Reset High-Speed Silicon-Gate CMOS

University of Toronto Faculty of Applied Science and Engineering Edward S. Rogers Sr. Department of Electrical and Computer Engineering

Very Large Scale Integration (VLSI)

Homework #4. CSE 140 Summer Session Instructor: Mohsen Imani. Only a subset of questions will be graded

Topics. CMOS Design Multi-input delay analysis. John A. Chandy Dept. of Electrical and Computer Engineering University of Connecticut

Digital Electronics Final Examination. Part A

ALU A functional unit

Transcription:

A Random Walk from Async to Sync Paul Cunningham & Steev Wilcox

Thank You Ivan

In the Beginning March 2002

Azuro Day 1 Some money in the bank from Angel Investors 2 employees Small Office rented from Cambridge University Computer ab Vision to automatically compile mainstream synchronous designs into low power asynchronous equivalents So why is Asynchronous design low power!?

Azuro Day 100 after several meetings with chip companies, VCs and lots of whiteboard time Device

EVE SENSITIVE AND GATE A B EDGE SENSITIVE AND GATE A B Z C Z A A B B Z A Z B B A 3X 4 transistors 12 transistors

Costs power to implement D delay request acknowledge Gmax < D Gmax Hard to sign-off across different process corner, voltage, temperature

request acknowledge acknowledge acknowledge request request

clock P W Gmin Hold Check: Gmin > W Setup Check: Gmax < P Gmax

clock Off-chip crystal oscillator P W Gmin Hold Check: Gmin > W Setup Check: Gmax < P Pulse-flopped Synchronous Design Gmax

clock Off-chip crystal oscillator P How do make sure the pulse width is predictable across multiple process corner, temperature and voltages? W Gmin How do you shift values in and out during chip test without violating the hold check? Hold Check: Gmin > W Setup Check: Gmax < P Gmax

Hold Check: Gmin > W Setup Check: Gmax < P Small W clock

clock D-type flip-flop (DFF) Hold Check: Gmin > W Setup Check: Gmax < P Very small and predictable W Small local circuit which can be carefully crafted and characterized across multiple process corners, voltages, temperatures

Azuro Day 1 Some money in the bank from Angel Investors 2 employees Small Office rented from Cambridge University Computer ab Vision to automatically compile mainstream synchronous designs into low power asynchronous equivalents Vision to bring the low power benefits of asynchronous design to the mainstream synchronous community

These delays don t all have to be the same request acknowledge acknowledge acknowledge request No data no latching (consume power only when needed) request

These delays don t all have to be the same request acknowledge acknowledge acknowledge request No data no latching (consume power only when needed) request

Power Saving Ideas Power in the clock Is wasted if the register did not capture new data Clock gating Power in the datapath Is wasted if the result of computation is not captured Guarding if ( ) b = a + If will not accept data a Don t send data + b Don t clock register

First Real Idea: Guard Flops Saw that flop can be decomposed T T So create Guard Flop Control accepting data Control routing of data T T T T

First Real Idea: Guard Flops Gained two US patents Reception was that it was still too disruptive Needed to create new standard cells New Methodologies Still too Async? Needed to come closer to the mainstream

Focus on Clock Gating Clock gates are like AND gates with one risingedge sensitive input Clock in Enable T Gated clock out Clock in Enable in Clock out

Focus on Clock Gating Clock Gating in 2002 was primitive Instantiated directly from RT expressions only if (a) b <= c; Three ways to improve: Find and exploit hierarchy Find new register-level expressions not in netlist Find sub-register-level expressions

Hierarchical Gating A A&B CG A&C CG CG B CG C CG Finding expressions requires BDD-based netlist traversal Rarely as simple as observing an AND gate in the netlist! Multiple levels of hierarchy usually available Heuristics to limit levels to 2 or 3 to avoid clock impact

Symbolic Gating Gating function CG Next-state function Use BDDs to decompose existing D-side function Gating function for whole register Next-state function assuming gating function pulled out May need to avoid decomposition If timing is tight, might need to pull out CG late in the day, and put back in old (green) function

Structural Gating Generalization of previously-published work: CG On it s own, this is useless Clock cap of CG as much or larger than flop! But can be combined with NAND, OR, and across subsets of registers

Structural Gating CG Register split into number of clusters, depending on simulation information (SAIF, VCD) Some parts of register may stay at logic 0 more than logic 1 Use OR gate for comparison CG Key is to make the whole thing activity-driven

First Product: PowerCentric VICTORY!

First Product: PowerCentric NASDAQ stock price 3000 2800 2600 2400 2200 Azuro ready to launch! 2000 1800 1600 1400 1200 1000 2007 2008 2009 2010

First Product: PowerCentric NASDAQ stock price 3000 2800 2600 2400 2200 2000 1800 1600 Houston, we have a problem 1400 1200 1000 2007 2008 2009 aunch abort! Hunker down for phase 2 2010

These delays don t all have to be the same request acknowledge acknowledge acknowledge request No data no latching (consume power only when needed) request

2004, In Azuro R&D Could now gate clock far more efficiently than any competitor However, we couldn t estimate effect on the clock tree Needed to generate our own clock tree synthesis algorithm Naturally, we used our Async experience

Azuro CTS Bottom-up clustering approach Cluster lower drivers together first More flexible, less synchronous in a way Ideally suited to deep clock gating Also ~10% lower power But has other benefits

Azuro CTS Productize useful skew Manage on-chip variation Azuro CTS allowed us to Tame clock complexity

Clock Complexity Do these flops need to be balanced? Innovated P-based scheduling algorithm to solve complex clock tree constraints

On-Chip Variation OCV derates hidden until CTS

OCV + Clock Complexity = Clock Timing Gap

Useful Skew Why does the clock need to arrive at every point at the same time? CK 600ps 1200ps 600ps Short stage ong stage Short stage Cycle time: 1200ps Think Async: What would we do?

Useful Skew CK Advance this signal 200ps Delay this signal 200ps 600 + 200ps 1200ps 400ps 600 + 200ps Short stage ong stage Short stage Cycle time now: 800ps 50% frequency improvement!

Useful Skew Has been tried before EDA vendors had useful skew engines Numerous academic papers on this Focus was either too lowbrow or too highbrow Tradtional approach was adding individual buffers. No ability to decrease delay. Buffers cost power. Academic approach was solving chip-wide P-style problems. Too hard for real chips (1M+ instances). No consideration of complex clocks No consideration of on-chip variation

Azuro CTS Manage on-chip variation Productize useful skew? Tame clock complexity

Traditional Design Worst Path Closure Speed limited by max D Path D1 D2 D3 Chain Clock Concurrent Design Worst Chain Closure Speed limited by avg D

Need to be propagated clocks slacks not ideal clocks slacks! slack1 slack2 D1 D2 slack3 slack1 = slack2 = slack3? D3

Clock Concurrent Optimization (CCOpt) Clock D3 Skew clocks and optimize logic concurrently Always using propagated clocks slacks

Azuro CTS Manage on-chip variation Productize useful skew Azuro Rubix (CCOpt) Tame clock complexity

Thank You