EECS 151/251A Spring 2018 Digital Design and Integrated Circuits Instructors: Nick Weaver & John Wawrzynek Lecture 10 EE141 1
What do ASIC/FPGA Designers need to know about physics? Physics effect: Area cost Delay performance Energy performance & cost Ideally, zero delay, area, and energy However, the physical devices occupy area, take time, and consume energy CMOS process lets us build transistors, wires, connections, and we get capacitors, inductors, and resistors whether or not we want them 2
Performance, Cost, Power How do we measure performance? operations/sec? cycles/sec? Performance is directly proportional to clock frequency Although it may not be the entire story: Ex: CPU performance = # instructions X CPI X clock period Spring 2018 EECS151 Page 3
Limitations on Clock Rate 1 Logic Gate Delay 2 Delays in flip-flops What are typical delay values? Both times contribute to limiting the clock period What must happen in one clock cycle for correct operation? All signals connected to FF (or memory) inputs must be ready and setup before rising edge of clock For now we assume perfect clock distribution (all flip-flops see the clock at the same time) Spring 2018 EECS151 Page 4
Example Parallel to serial converter circuit clk a T time(clk Q) + time(mux) + time(setup) T τ clk Q + τ mux + τ setup b Spring 2018 EECS151 Page 5
In General For correct operation: T τ clk Q + τ CL + τ setup for all paths How do we enumerate all paths? Any circuit input or register output to any register input or circuit output? Note: setup time for outputs is a function of what it connects to clk-to-q for circuit inputs depends on where it comes from Spring 2018 EECS151 Page 6
Gate Delay Modern CMOS gate delays on the order of a few picoseconds (However, highly dependent on gate context) Often expressed as FO4 delays (fan-out of 4) - as a process independent delay metric: the delay of an inverter, driven by an inverter 4x smaller than itself, and driving an inverter 4x larger than itself For a 90nm process FO4 is around 20ps Less than 10ps for a 32nm process 7
Path Delay For correct operation: Total Delay clock_period - FFsetup_time - FFclk_to_q on all paths High-speed processors critical paths have around 20 FO4 delays 8
FO4 Delays per clock period FO4 Delays 100 90 CPU Clock Periods 1985-2005 intel 386 80 intel 486 intel pentium 70 60 MIPS 2000 5 pipeline stages intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 50 Pentium Pro 10 pipeline stages Alpha 21164 Alpha 21264 Sparc Historical limit: about 12 40 30 20 10 Pentium 4 20 pipeline stages SuperSparc Sparc64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 0 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 Thanks to Francois Labonte, Stanford 9
CPU DB: Recording Microprocessor History With this open database, you can mine microprocessor trends over the past 40 years Andrew Danowitz, Kyle Kelley, James Mao, John P Stevenson, Mark Horowitz, Stanford University F04 Delays Per Cycle for Processor Designs 140 120 100 F04 / cycle 80 60 40 20 0 1985 1990 1995 2000 2005 2010 2015 FO4 delay per cycle is roughly proportional to the amount of computation completed per cycle 10
Gate Delay What determines the actual delay of a logic gate? Transistors are not perfect switches - cannot change terminal voltages instantaneously Consider the NAND gate: Current (I) value depends on: process parameters, transistor size CL / I CL models gate output, wire, inputs to next stage (Cap of Load) C integrates I creating a voltage change at output 11
More on transistor Current Transistors act like a cross between a resistor and current source ISAT depends on process parameters (higher for nfets than for pfets) and transistor size (layout): ISAT W/L 12
Physical Layout determines FET strength Switch-level abstraction gives a good way to understand the function of a circuit nfet (g=1? short circuit : open) pfet (g=0? short circuit : open) Understanding delay means going below the switch-level abstraction to transistor physics and layout details 13
Transistors as water valves (Cartoon physics) If electrons are water molecules, transistor strengths (W/L) are pipe diameters, and capacitors are buckets Vdd 1 A on p-fet fills up the capacitor with charge Open Charge 0 Water level Time Vdd Vdd 1 A on n-fet empties the bucket Open Out n Discharge 0 Water level 14 Time CS 250 L4: Timing UC Regents Spring 2016 UCB
The Switch Dynamic Model (Simplified) V GS S V GS V T S C S G G R on D C G D C D EE141 15
Switch Sizing What happens if we make a switch W times larger (wider) G V GS S V GS V T S C S W W G R on /W D C G W D C D W EE141 16
Switch Parasitic Model The pull-down switch (NMOS) V out V out V in R N C D V in R N W WC D C G WC G Minimum-size switch Sizing the transistor (factor W) EE141 We assume transistors of minimal length (or at least constant length) R s and C s in units of per unit width 17
Switch Parasitic Model The pull-up switch (PMOS) V in R P = 2R N V in R N V in R N W C G V out 2C G V out 2WC G V out C D 2C D 2WC D Minimum-size switch Sized for symmetry General sizing EE141 18
Inverter Parasitic Model Drain and gate capacitance of transistor are directly related by process (γ 1) V in R N W V out C D = γc G C in = 3WC G C int = 3WC D = 3WγC G R N W t p = 069 R N (3WγC G ) = 069(3γ)R N C G W Intrinsic delay of inverter independent of size EE141 19
Turning Rise/Fall Delay into Gate Delay Cascaded gates: 1 0 0 1 1 0 0 1 transfer curve for inverter 20
The Switch Inverter: Transient Response V(t) = V0 e t/rc t1/2 = ln(2) RC t phl = f(r on C L ) = 069 R n C L V in V in C in C in (a) Low-to-high (b) High-to-low EE141 21
More on CL Everything that connects to the output of a logic gate (or transistor) contributes capacitance: I Transistor drains Interconnection (wires/contacts/ vias) Transistor Gates 22
Inverter with Load Capacitance C in = 3WC G V in R N W R N W V out C int = 3WγC G C L t p = = = = R 069 W R 069 W 069(3C t inv N N G C ( γ + C ( C int ) = t + C 0 ( γ ) (3WγC G + C CL RN )( γ + ) C L in L in L + f ) ) f = fanout = ratio between load and input capacitance of gate EE141 23
Inverter Delay Model t p =t inv (γ+f) t inv technology constant Can be dropped from expression Delay unit-less variable (expressed in unit delays) Delay γ t p =γ+f f Question: how does transistor sizing (W) impact delay? EE141 24
Wire Delay Ideally, wires behave as transmission lines : signal wave-front moves close to the speed of light ~1ft/ns Time from source to destination is called the transit time In ICs most wires are short, and the transit times are relatively short compared to the clock period and can be ignored Not so on PC boards Spring 2018 EECS151 Page 25
Wires As parallel plate capacitors: C Area = width length Wires have finite resistance, so have distributed R and C: with r = res/length, c = cap/length, rcl 2 rc + 2rc +3rc + v1 v2 v3 v4 v1 v2 v3 v4 time 26
Even in those cases where the transmission line effect is negligible: Wires posses distributed resistance and capacitance v1 v2 v3 v4 Time constant associated with distributed RC is proportional to the square of the length Wire Delay For short wires on ICs, resistance is insignificant (relative to effective R of transistors), but C is important Typically around half of C of gate load is in the wires For long wires on ICs: busses, clock lines, global control signal, etc Resistance is significant, therefore distributed RC effect dominates signals are typically rebuffered to reduce delay: v1 v2 v3 v4 time Spring 2018 EECS151 Page 27
Recall: Positive edge-triggered flip-flop D Q A flip-flop samples right before the edge, and then holds value clk Sampling circuit clk Holds value clk clk clk clk clk Clock to Q delay results fr clk 28 EECS 151 UC Regents Spring 2018 UCB
Sensing: When clock is low D Q A flip-flop samples right before the edge, and then holds value clk Sampling circuit clk Holds value clk clk clk clk clk clk clk Clock to Q delay results fr clk = 0 clk = 1 clk clk clk clk clk EECS 151 Will capture new value clk Clock to Q delay on posedge results fr 29 Outputs clk last value captured UC Regents Spring 2018 UCB
Capture: When clock goes high D Q A flip-flop samples right before the edge, and then holds value clk Sampling circuit clk Holds value clk clk clk clk clk clk Clock to clk Q delay results fr clk = 1 clk = 0 clk clk clk clk clk EECS 151 Remembers value clk Clock to Q delay just captured results fr 30 Outputs value just clk captured UC Regents Spring 2018 UCB
Flip Flop delays: clk-to-q? setup? hold? clk clk D Q CLK clk clk clk clk CLK == 0 Sense D, but Q outputs old value clk Clock to Q delay results fr setup clk CLK 0->1 Capture D, pass value to Q hold? clk-to-q 31 EECS 151 UC Regents Spring 2018 UCB
Timing Analysis and Logic Delay Register: An Array of Flip-Flops Combinational Logic If our clock period T > worst-case delay 32through CL, does this ensure correct operation? EECS 151 UC Regents Spring 2018 UCB
Flip-Flop delays eat into time budget Combinational Logic ALU time budget T! # clk"q + # CL + # setup 33 EECS 151 UC Regents Spring 2018 UCB
Clock skew also eats into time budget CLKd CLK CLK CLK CLKd CLK CL As T 0, which circuit fails first? CL CLK CLK CLKd clock skew, delay in distribution EECS 151 T " T CL +T setup +T clk!q + worst case skew ost modern large high-performance chi 34 UC Regents Spring 2018 UCB
Delay Grid Tuned sector trees Delay Sector buffers x CS 250 L3: Timing Clock Tree Delays, IBM Power CPU y Buffer level 2 Buffer level 35 1 UC Regents Fall 2013 UCB
15 10 Delay Volts (V) 20 ps skew 05 00 0 500 1000 1500 2000 2500 Time (ps) Multiplefingered transmissio line x CS 250 L3: Timing Clock Tree Delays, IBM Power 36 y UC Regents Fall 2013 UCB
Clock Skew (cont) CLK CLK CL CLK CLK clock skew, delay in distribution Note reversed buffer In this case, clock skew actually provides extra time (adds to the effective clock period) This effect has been used to help run circuits as higher clock rates Risky business! Spring 2018 EECS151 Page 37
Components of Path Delay 1 # of levels of logic 2 Internal cell delay 3 wire delay 4 cell input capacitance 5 cell fanout 6 cell output drive strength 38
Who controls the delay? foundary engineer (TSMC) Library Developer (Aritsan) CAD Tools (DC, IC Compiler) Designer (you!) 1 # of levels synthesis RTL 2 Internal cell delay physical parameters cell topology, trans sizing cell selection 3 Wire delay physical parameters place & route layout generator 4 Cell input capacitance physical parameters cell topology, trans sizing cell selection instantiation 5 Cell fanout synthesis RTL 6 Cell drive strength physical parameters transistor sizing cell selection instantiation 39
From Delay Models to Timing Analysis clk Timing Analysis What is the smallest T that produces correct operation? Or, can we meet a target T? f T 1 MHz 1 μs 10 MHz 100 ns 100 MHz 10 ns 1 GHz 1 ns 40 EECS 151 UC Regents Spring 2018 UCB
Timing Closure: Searching for and beating down the critical path? Must consider all connected register pairs, paths, plus from input to register, plus register to output Design tools help in the search Synthesis tools work to meet clock constraint, report delays on paths, Special static timing analyzers accept a design netlist and report path delays, and, of course, simulators can be used to determine timing performance Tools that are expected to do something about the timing behavior (such as synthesizers), also include provisions for specifying input arrival times (relative to the clock), and output requirements (set-up 41 times of next stage)
Timing Analysis, real example The critical path Most paths have hundreds of picoseconds to spare Late-mode timing checks (thousands) 200 150 100 50 0 40 20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 Timing slack (ps) From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, JD Warnock et al 42
Timing Optimization As an ASIC/FPGA designer you get to choose: The algorithm The Microarchitecture (block diagram) The RTL description of the CL blocks (number of levels of logic) Where to place registers and memory (the pipelining) Overall floorplan and relative placement of blocks 43
How to retime logic Circles are combinational logic, labelled with delays Critical path is 5 We want to improve it without changing circuit semantics IN 1 1 1 1 2 2 OUT Figure 1: A small graph before retiming The nodes represent logic delays, with the inputs and outputs passing through mandatory, fixed registers The critical path is 5 Add a register, move one circle Performance improves by 20% IN 1 1 1 1 2 2 OUT Figure 2: The example in Figure 2 after retiming The critical path is reduced from 5 to 4 Post-Placement C-slow Retiming for the Xilinx Virtex FPGA Logic Synthesis tools can do 44 this in simple cases Nicholas Weaver UC Berkeley Berkeley, CA Yury Markovskiy UC Berkeley Berkeley, CA Yatish Patel UC Berkeley Berkeley, CA John Wawrzynek UC Berkeley Berkeley, CA
Floorplaning: essential to meet timing 45 EECS 151 (Intel XScale 80200) UC Regents Spring 2018 UCB
Timing Analysis Tools Static Timing Analysis: Tools use delay models for gates and interconnect Traces through circuit paths Cell delay model capture For each input/output pair, internal delay (output load independent) output dependent delay Standalone tools (PrimeTime) and part of logic synthesis delay Back-annotation takes information from results of place and route to improve accuracy of timing analysis DC in topographical mode uses preliminary layout information to model interconnect parasitics Prior versions used a simple fan-out model of gate loading output load 46 EECS151, UC Berkeley Sp18
clk Hold-time Violations d FF q Some state elements have positive hold time requirements How can this be? Fast paths from one state element to the next can create a violation (Think about shift registers!) CAD tools do their best to fix violations by inserting delay (buffers) Of course, if the path is delayed too much, then cycle time suffers Difficult because buffer insertion changes layout, which changes path delay 47 EECS151, UC Berkeley Sp18
Driving Large Loads Large fanout nets: clocks, resets, memory bit lines, off-chip Relatively small driver results in long rise time (and thus large gate delay) Strategy: Staged Buffers How to optimally scale drivers? Optimal trade-off between delay per stage and total number of stages? 48
Inverter Chain In Out C L For some given C L : How many stages are needed to minimize delay? How to size the inverters? Anyone want to guess the solution? EE141 49
Careful about Optimization Problems Get fastest delay if build one very big inverter So big that delay is set only by self-loading Cload Likely not the problem you re interested in Someone has to drive this inverter EE141 50
Delay Optimization Problem #1 You are given: A fixed number of inverters The size of the first inverter The size of the load that needs to be driven Your goal: Minimize the delay of the inverter chain Need model for inverter delay vs size EE141 51
Apply to Inverter Chain In Out 1 2 N C L t p = t p1 + t p2 + + t pn t pj = tinv γ + C in, j+ 1 C in, j t t t C C C N N in, j+ 1 p = p, j = inv γ +, in, N + 1 = L j= 1 i= 1 C in, j EE141 52
Optimal Sizing for Given N Delay equation has N-1 unknowns, C in,2 C in,n To minimize the delay, find N-1 partial derivatives: C C t = + t + t + in, j in, j+ 1 p inv inv Cin, j 1 Cin, j dt 1 C = = dc C C p in, j+ 1 tinv tinv 2 in, j in, j 1 in, j 0 EE141 53
Optimal Sizing for Given N (cont d) Result: every stage has equal fanout (f): C C in, j in, j+ 1 in, j 1 in, j Size of each stage is geometric mean of two neighbors: = C C C = C C in, j in, j 1 in, j + 1 Equal fanout à every stage will have same delay EE141 54
Optimum Delay and Number of Stages When each stage has same fanout f : Fanout of each stage: Minimum path delay: N f = F = C / C f = N F p inv L in,1 ( N γ ) t = Nt + F EE141 55
Example In C 1 1 f f 2 Out C L = 8 C 1 C L /C 1 has to be evenly distributed across N = 3 stages: EE141 56
Delay Optimization Problem #2 You are given: The size of the first inverter The size of the load that needs to be driven Your goal: Minimize delay by finding optimal number and sizes of gates So, need to find N that minimizes: ( γ ) N t = Nt + C C p inv L in EE141 57
Untangling the Optimization Problem Rewrite N in terms of fanout/stage f: N f CL Cin N = = ln( CL Cin ) ln f ( ) 1/ N ) f + γ t p = Ntinv CL Cin + γ = tinv ln( CL Cin ) ln f t p ln f 1 γ f = tinv ln( CL Cin ) = 0 2 f ln f f = exp 1+ ( γ f ) For γ = 0, f = e, N = ln (C L /C in ) (no explicit solution) EE141 58
Optimum Effective Fanout f Optimum f for given process defined by γ 5 f = exp 1+ ( γ f ) 45 f opt 4 35 e 3 f opt = 36 for γ = 1 25 0 05 1 15 2 25 3 γ EE141 59
In Practice: Plot of Total Delay [Hodges, p281] Why the shape? Curves very flat for f > 2 Simplest/most common choice: f = 4 EE141 60
Normalized Delay As a Function of F N t = Nt ( γ + F ), F = C C p inv L in [Rabaey: page 210] (γ = 1) EE141 61
Conclusion Timing Optimization: You start with a target on clock period What control do you have? Biggest effect is RTL manipulation ie, how much logic to put in each pipeline stage We will be talking later about how to manipulate RTL for better timing results In most cases, the tools will do a good job at logic/circuit level: Logic level manipulation Transistor sizing Buffer insertion But some cases may be difficult and you may need to help The tools will need some help at the floorpan and layout 62 EECS151, UC Berkeley Sp18