EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

Similar documents
Designing Sequential Logic Circuits

Final presentations May 8, 1-5pm, BWRC Final reports due May 7, 8pm Final exam, Monday, May :30pm, 241 Cory

Timing Issues. Digital Integrated Circuits A Design Perspective. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolić. January 2003

Lecture 9: Sequential Logic Circuits. Reading: CH 7

GMU, ECE 680 Physical VLSI Design 1

EE241 - Spring 2007 Advanced Digital Integrated Circuits. Announcements

EE115C Winter 2017 Digital Electronic Circuits. Lecture 19: Timing Analysis

Digital Integrated Circuits A Design Perspective

The Linear-Feedback Shift Register

Skew-Tolerant Circuit Design

CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues

Xarxes de distribució del senyal de. interferència electromagnètica, consum, soroll de conmutació.

Lecture 9: Clocking, Clock Skew, Clock Jitter, Clock Distribution and some FM

Jin-Fu Li Advanced Reliable Systems (ARES) Lab. Department of Electrical Engineering. Jungli, Taiwan

UNIVERSITY OF CALIFORNIA

L4: Sequential Building Blocks (Flip-flops, Latches and Registers)

Hold Time Illustrations

Integrated Circuits & Systems

EECS 427 Lecture 15: Timing, Latches, and Registers Reading: Chapter 7. EECS 427 F09 Lecture Reminders

CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II

UNIVERSITY OF CALIFORNIA, BERKELEY College of Engineering Department of Electrical Engineering and Computer Sciences

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

L4: Sequential Building Blocks (Flip-flops, Latches and Registers)

Digital System Clocking: High-Performance and Low-Power Aspects. Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M.

Clock Strategy. VLSI System Design NCKUEE-KJLEE

UMBC. At the system level, DFT includes boundary scan and analog test bus. The DFT techniques discussed focus on improving testability of SAFs.

Lecture 27: Latches. Final presentations May 8, 1-5pm, BWRC Final reports due May 7 Final exam, Monday, May :30pm, 241 Cory

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

MODULE 5 Chapter 7. Clocked Storage Elements

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

ΗΜΥ 307 ΨΗΦΙΑΚΑ ΟΛΟΚΛΗΡΩΜΕΝΑ ΚΥΚΛΩΜΑΤΑ Εαρινό Εξάμηνο 2018

Lecture: Pipelining Basics

EE371 - Advanced VLSI Circuit Design

EE141Microelettronica. CMOS Logic

EE241 - Spring 2006 Advanced Digital Integrated Circuits

Lecture Outline. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. Total Power. Energy and Power Optimization. Worksheet Problem 1

Structural Delay Testing Under Restricted Scan of Latch-based Pipelines with Time Borrowing

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Data AND. Output AND. An Analysis of Pipeline Clocking. J. E. Smith. March 19, 1990

Problem Set 9 Solutions

Chapter 5 CMOS Logic Gate Design

Energy Delay Optimization

COMP 103. Lecture 16. Dynamic Logic

VLSI Signal Processing

5. Sequential Logic x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. November Digital Integrated Circuits 2nd Sequential Circuits

THE UNIVERSITY OF MICHIGAN. Faster Static Timing Analysis via Bus Compression

Hw 6 due Thursday, Nov 3, 5pm No lab this week

Chapter 3. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 3 <1>

Next, we check the race condition to see if the circuit will work properly. Note that the minimum logic delay is a single sum.

Digital Integrated Circuits A Design Perspective

Lecture 13: Sequential Circuits, FSM

ELCT201: DIGITAL LOGIC DESIGN

Digital Integrated Circuits A Design Perspective

Digital Integrated Circuits A Design Perspective

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Digital Integrated Circuits A Design Perspective

CMPEN 411. Spring Lecture 18: Static Sequential Circuits

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

EE241 - Spring 2000 Advanced Digital Integrated Circuits. Announcements

Homework 2 due on Wednesday Quiz #2 on Wednesday Midterm project report due next Week (4 pages)

Computer Architecture ELEC2401 & ELEC3441

9/18/2008 GMU, ECE 680 Physical VLSI Design

EE241 - Spring 2001 Advanced Digital Integrated Circuits

EECS 427 Lecture 14: Timing Readings: EECS 427 F09 Lecture Reminders

School of EECS Seoul National University

Sequential Logic Design: Controllers

ALU, Latches and Flip-Flops

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

EECS 151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: Nick Weaver & John Wawrzynek. Lecture 10 EE141

Managing Physical Design Issues in ASIC Toolflows Complex Digital Systems Christopher Batten February 21, 2006

Lecture 4. Adders. Computer Systems Laboratory Stanford University

Topics. Dynamic CMOS Sequential Design Memory and Control. John A. Chandy Dept. of Electrical and Computer Engineering University of Connecticut

Issues on Timing and Clocking

Lecture 13: Sequential Circuits, FSM

Methodology to Achieve Higher Tolerance to Delay Variations in Synchronous Circuits

Reducing Delay Uncertainty in Deeply Scaled Integrated Circuits Using Interdependent Timing Constraints

Digital Fundamentals

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

A glance on the analytical model of power-performance performance trade-off in VLSI microprocessor design. Sapienza University of Rome

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Design for Manufacturability and Power Estimation. Physical issues verification (DSM)

Introduction to CMOS VLSI Design Lecture 1: Introduction

ΗΜΥ 307 ΨΗΦΙΑΚΑ ΟΛΟΚΛΗΡΩΜΕΝΑ ΚΥΚΛΩΜΑΤΑ Εαρινό Εξάμηνο 2018

Chapter 3. Chapter 3 :: Topics. Introduction. Sequential Circuits

Itanium TM Processor Clock Design


EE141- Spring 2007 Digital Integrated Circuits

Constrained Clock Shifting for Field Programmable Gate Arrays

Lecture 8-1. Low Power Design

Sequential Logic Worksheet

Fundamentals of Computer Systems

Chapter 13. Clocked Circuits SEQUENTIAL VS. COMBINATIONAL CMOS TG LATCHES, FLIP FLOPS. Baker Ch. 13 Clocked Circuits. Introduction to VLSI

Stop Watch (System Controller Approach)

Synchronizers, Arbiters, GALS and Metastability

Digital Integrated Circuits Designing Combinational Logic Circuits. Fuyuzhuo

EET 310 Flip-Flops 11/17/2011 1

INTEGRATED CIRCUITS. For a complete data sheet, please also download:

GMU, ECE 680 Physical VLSI Design

Transcription:

Slide 1 EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

Slide 2 Topics Clocking Clock Parameters Latch Types Requirements for reliable clocking Pipelining Optimal pipelining Pipeline partitioning Asynchronous and self timed logic Wave Pipelining and low overhead clocking

Slide 3 Clock Parameters Parameters P max - maximum delay thru logic P min - minimum delay thru logic t - cycle time t w - clock pulse width t g - data setup time t d - register output delay C - total clocking overhead t t d P max t g t w t = P max + C

Slide 4 Latch Types Cycle time depends on clock parameters and underlying latch edge- vs level-triggered single- vs dual-rank

Slide 5 Clock Overhead Trigger/Rank Single Dual Level t g +t d 2(t g +t d ) Edge t g +t d t g +t d +t w Note: Parameter values vary with technology and implementation. E.g., set-up time t g will generally be less for level-trigger than edge-trigger latch in same technology

Slide 6 Reliable Clocking t w > minimum pulse width and t w > hold time t > P max + clock overhead t w < P min for transparent latches can be avoided by edge triggered dual rank registers multiphase clock

Slide 7 Multiphase Clock Alternate stages use different clock phases clock phases don t overlap CLK1 CLK2 R1 logic R2 logic R3 CLK1 CLK2

Slide 8 Latching Summary Edge-Triggered, Single-Rank + Relatively simple to generate and distribute single clock -- Hazard for fast paths if P min < clock skew but easy, inexpensive to pad gates for very short paths -- Cannot borrow time across latches Edge-Triggered, Dual-Rank + Safest, hazard-free clock -- Biggest clock overhead Level-Triggered, Single Rank (Pulsed Latch) + Minimum clock overhead + Few, simple latches => reduces area and power -- Hazard for fast paths -- Difficult to distribute, control narrow pulses Level-Triggered, Dual-Rank + Relatively simple to generate and distribute clock + Simple to avoid hazards with non-overlapped phases + Can borrow time across latches -- Larger clock overhead than single-rank

Slide 9 Skew Skew is uncertainty in the clock arrival time two types of skew depends on t...skew = k, a fraction of P max where P max is the segment delay that determines t large segments may have longer delay and skew Part of skew varies with L eff, like segment delay independent of t...skew = δ Can relate to clock routing, jitter from environmental conditions, other effects unrelated to segment delay effect of skew = k(p max ) + δ skew range adds directly to the clock overhead

Slide 10 Clocking Summary Overhead depends on clocking scheme and latch implementation Growing importance for microprocessors at frequencies > 300 MHz Tradeoffs must be made carefully considering circuit, microarchitecture, CAD, system Common approach Distribute single clock to all blocks in balanced H-Tree Gate clock at each block for power savings Generate multiphase clocks for local circuit timing Other approaches Distribute single clock, but do not gate Use clock for both phases with TSPC latch Distribute single clock, generate pulses locally for pulse latches (?) Resulting parameter C is used in pipeline tradeoffs Clock Skew has 2 components Variable component, factor k We will use this to stretch Pmax Constant (worst-case) factor δ We will fold this into clock overhead C And we have not even touched the issue of asynchronous design

Slide 11 Optimum Pipelining Let the total instruction execution without pipelining and associated clock overhead be T In a pipelined processor let S be the number of segments in the pipeline Then S - 1 is the maximum number of cycles lost due to a pipeline break Let b = probability of a break Let C = clock overhead including fixed clock skew

Slide 12 Optimum Pipelining P1 P2 P3 P4 T P max i = delay of the i th functional unit suppose T = Σ i P max i ; without clock overhead S = number of pipeline segments C = clock overhead T/S >= max (P max i ) [quantization]

Slide 13 t = T/S + kt/s + C = (1+k)T/S + C Performance = 1/ (1+(S - 1)b) [IPC] Thruput = G = Performance / t [IPS] G = ( 1 / (1+(S - 1)b) x ( 1 / ((1 + k)(t/s)) + C ) Finding S for dg/ ds = 0 We get S opt

Slide 14 Optimum Pipelining Sopt = (1 b)(1 + k)t bc

Slide 15 Finding S opt Estimate b and k...use k = 0.05 if unknown b from instruction traces Find T and C from design details feasibility studies Find S opt Example Clock Overhead % b k T (ns) C (ns) Sopt G (MIPS) f (MHZ) CPI 0.1 0.05 15 0.5 16.8 270 697 2.58 34.8% 0.1 0.05 15 1 11.9 206 431 2.09 43.1% 0.2 0.05 15 0.5 11.2 173 525 3.04 26.3% 0.2 0.05 15 1 7.9 140 335 2.39 33.5% Clock Overhead = C/ T

Slide 16 Quantization and Other Considerations Now, consider the quantization effects T cannot be arbitrarily divided into segments segments are defined by functional unit delays some segments cannot be divided; others can be divided only at particular boundaries some functional ops are atomic (usually) can t have cycle fractionally cross a function unit boundary S opt ignores cost (area) of extra pipeline stages the above create quantization loss therefore: S opt is the largest S to be used and the smallest cycle to be considered is t = (1+k)T/S opt + C

Quantization t i = execution time of i th unit or block T = total instruction execution time w/o pipeline S = no. pipeline stages C = clock overhead t m = t - C = time per stage for logic T = Σ i t i = time for instruction execution w/o pipeline S t = S(t m + C) = (ignore variable skew) S t - T = S(t m + C) - T (pipeline length overhead) = [St m - Σ i t i ] + SC [quantization overhead + clock overhead] Vary # pipe stages => opposing effects of quantization/clock overhead See Study 2.2 page 78 Michael Flynn EE382 Winter/99 Slide 17

Slide 18 Microprocessor Design Practice (Part I) Need to consider variation of b with S Increasing S results in additional and longer pipe delays Start design target at maximum frequency for ALU+bypass in single cycle Critical to keep ALU+bypass in single clock for performance on general integer code Tune frequency to minimize load delay through cache Try to fit rest of logic into pipeline at target frequency Simplify critical paths, sacrificing IPC modestly if necessary Optimize paths with slack time to save area, power, effort

Slide 19 Microprocessor Design Practice (Part II) Tradeoff around this design target Optimal in-order integer pipe for RISC has 5-10 stages Performance tradeoff is relatively flat across this range Deeper for out-of-order or complex ISA (like Intel Architecture) Use longer pipeline (higher frequency) if FP/multimedia vector performance are important and clock overhead is low Else use shorter pipeline especially if area/power/effort are critical to market success

Slide 20 Advanced techniques Asynchronous or self timed clocking avoids clock distribution problems but has its own overhead. Multi phase domino clocking skew tolerant and low clock overhead; lots of power required and extra area. Wave pipelining avoids clock overhead problems, but is sensitive to skew and hence clock distribution.

Slide 21 Self-Timed Circuits Dual-Rail Logic Gate Completion Detection Logic Value Reset False True Invalid 00 01 10 11 Eval/ Hold a b AND Vdd a b y y A A B B C C Inputs... D Done Done 1 J. Rabaey,, Digital Integrated Circuits a Design Perspective, Prentice Hall 1996 ch.. 9. 2 T. Williams and M. Horowitz,, A zero-overhead self-timed 160nS 54-b CMOS divider, IEEE Journal of Solid-State Circuits, vol.. 26, pp.1651-1661, Nov. 1991.

Self-Timed Pipeline Ack Ack Req C C C C D D D D Req Data in Data out in Data Hold Eval Reset Reset Michael Flynn EE382 Winter/99 Slide 22 Domino Logic Domino Logic Domino Logic Domino Logic

Slide 23 Evaluation process C output is high for eval/hold; low for reset previous stage submits data; then req for eval(uation) D(one) signal is asserted when data inputs are available. This causes evaluation in this stage if its successor stage has been reset and its D signal is low. Overhead includes D and C logic and two segments of reset (precharge).

Slide 24 Self-Timed Circuit Summary Delay-Insensitive Technique (Both gate and propagation delay) Can use fast Domino Logic Dual-rail logic implementation requires more Area Significant Overhead on Cycle time.

Slide 25 Multi-Phase Domino Clock Techniques Uses Domino Logic for Data Storage and Logical functions Reduces Clocking Overhead (Clock Skew, Latch Setup and Hold, Time Stealing) D. Harris and M. Horowitz, Skew-Tolerant Domino Circuits, IEEE Journal of Solid-State Circuits, vol. 32, pp. 1702-1711, Nov. 1997.

Slide 26 Domino Logic AND Gate Vdd Clk a b a b y y Clk

Slide 27 4-Phase Overlapped Clock Pmax Phase 0 Phase 1 Phase 2 Phase 3 Eval 0 Eval 1 Eval 2 Skew Tolerance Eval 3 Eval 0 Eval 1 Eval 2 Eval 3 Pre 0 Pre 1 Pre 2 Pre 0 Pre 3 Pre 1

Slide 28 Wave Pipelining The ultimate limit on t Uses P min as storage instead of latches.

Slide 29 Wave Pipelining i th segment P max R s R D P min

Slide 30 At time = t 1 let data 1 proceed into pipeline stage It can be safely clocked at the destination latch at time t 3 t 3 = t 1 + P max + C But new data 2 can proceed into the pipeline earlier by an amount P min say, at time t 2, where t 2 = t 1 + P max + C - P min so that t 2 - t = P 1 max - P min + C = t t = the minimum cycle time for this segment

Slide 31 minimum system t = { max t i } over all i segments Note that data 1 still must be clocked into the destination at t 3 For wave pipelining to work properly, the clock must be constructively skewed so that the data wave and the clock arrive at the same time.

Slide 32 Let CSi = constructive clock skew for the i th pipeline stage Then : CS i = [Σ j=1 (P max + C) j ] mod t, summed to the i th stage the alternative is to force each stage to complete with the clock by adding delay, K, to both P max and P min, so that [Σ j=1 (P max + K + C) j ] mod t = 0 for all i stages since K is added to both P max and P min, t is unaffected

Slide 33 Example P max = 12 ns R s R D P min = 8 ns CLK CS 1 C= 1 ns

Slide 34 Wave and Optimum Pipelining Sopt = (1 b)(1 + k)t bc b in the above also acts as a limit on the usefulness of wave pipelining, since only those applications with low b or large S can effectively use the low t available from wave pipelining. These applications would include vector and signal processors.

Slide 35 Limits on Wave Pipelining The limit on the difference, P max - P min, has two components let v = P max - P min = f ( α,β) α is the static design variation and β is the environmental variance typically α = P max / P min is controllable to 1.1. Ignoring C this allows 10 waves of data in a pipeline. But usually β is a more constraining limit. Unless on-chip compensation (thru the power supply) is used the limit on β = P max / P min is only 2 or even 3 limiting the improvement on t to 3 or 2 times the conventional t

Slide 36 Summary Minimizing clock overhead is critical to high performance pipeline design Exploring limits for optimal pipelines can bound design space and give insight to tradeoff sensitivity Vector pipeline frequency is limited by variability in delay, not max delay Performance (throughput or frequency) improves as much from increasing minimum delay as from reducing max delay Wave pipelining and similar techniques may prove practical Rest of course will assume conventional clocking with cycle time set by max delay and clock skew