Name: Answers. Mean: 38, Standard Deviation: 15. ESE370 Fall PDF Free Download

University of Pennsylvania Department of Electrical and System Engineering Circuit-Level Modeling, Design, and Optimization for Digital Systems ESE370, Fall 2012 Final Friday, December 14 Problem weightings shown. Calculators allowed. Closed book = No text or notes allowed. Final answers here. Additional workspace in exam book. Note where to find work in exam book if relevant. Sign Code of Academic Integrity statement at back of exam book. Name: Answers Mean: 38, Standard Deviation: 15 This ended up being a more time-constrained exam than intended. 1

Default technology: 22nm Low Standby Power Process (LSTP) γ = 1 V dd =900mV nominal V thn = V thp =600mV C 0 = 2 10 17 F (for W = 1 device) I d,sat0 = 10µA (for W = 1 device) I sd,leak0 = 0.3 pa (for W = 1 device) velocity saturated operation R wire = 700KΩ/cm C wire = 1.7pF/cm prefix scale G Giga 10 9 M Mega 10 6 K Kilo 10 3 c centi 10 2 m milli 10 3 µ micro 10 6 n nano 10 9 p pico 10 12 f femto 10 15 Optimally buffered wiring: R0 (γ + 1)C 0 L seg = 2 (1) R wire C wire Transmission line: W buf = w = R0 C wire 2R wire C 0 (2) c ɛr µ r (3) where c = 3 10 8 m/s. 2

1. Communication over a distance. For this problem, you want to send a signal across 1 cm of an integrated circuit chip. You will evaluate delay and energy for 3 scenarios then estimate how one will change when we scale technology. Show symbolic equations and final absolute numbers (ns, J). (a) What is R 0 and τ for this technology? [5pts] Symbolic R 0 Absolute 90 KΩ R V 0 C 0 dd I d,sat0 τ 1.8ps (b) What is the delay to send a bit from one end of the wire to the other on an unbuffered wire driven by a minimum size (W = 1) inverter? [5pts] Symbolic 2γτ+ Absolute 750ns ( R0 + R ) wire 1cm 2 (Cwire 1cm) + (R 0 + R wire 1cm) 2C 0 (c) What is the energy per bit transmitted for the unbuffered scenario above? [5pts] Symbolic 0.5(4C0 + C wire 1cm) (V dd ) 2 Absolute 7 10 13 J 3

(d) If you buffered the wire with a W buf =74 every L seg =3.3mm and try to drive the wire with minimum delay, what is the delay to send a bit from one end of the wire to the other, starting from a minimum size inverter as the input? Describe all the buffers you add to drive the wire and their size and placement. [10pts] Hint: Write down equations will all the terms. Don t omit any before evaluating magnitude. This should have been optimally buffered with L seg =0.033mm as a slight simplification on 0.034mm. R0 (γ + 1)C 0 L seg = 2 = 2 9 104 2 2 10 17 R wire C wire 7 10 5 /cm 1.7 10 12 /cm = 2 3 2cm 10 13 7 1.7 10 7 (4) 10 6 L seg = 2 3 2cm = 0.0034cm (5) 7 1.7 Working the problem as stated: You first want to optimally buffer up to the W = 74 buffer. Staging up by 4, you have: W = 4, W = 16, W = 64, then the W = 74 of the buffer on the first segment. 8 + 2γτ = 10τ driving from a W to 4W buffer. So, it takes 10τ to drive the W = 4, W = 16, and W = 64 inverters. The W = 64 buffer driving the W = 74 buffer is 2.3 + 2γτ 5τ. This means 35τ before driving the line. Buffers W = 4, W = 16, W = 64 and sizing Symbolic 35τ + 3 ( 2τ + R 0 74 0.34cmC wire + 2τ ) Absolute +3 (0.34cmR wire (0.5 0.34cmC wire + 2 74C 0 )) 63ps + 3 0.7ns + 204ns 210ns With L seg = 0.0034cm, the delay is around 8.5ns. 4

(e) What is the energy per bit transmitted for the buffered scenario above? [5pts] Symbolic 0.5 (4 (74 3 + 64 + 16 + 4 + 1) C0 + C wire 1cm) (V dd ) 2 Absolute 7 10 13 J This changes much more for the intended, optimally buffered case, rising to 1.4pJ. 5

(f) Consider transmitting the bit for the 1 cm distance by sending it off chip through a Z = 50Ω transmission line on a PCB with ɛ r = 4, µ r = 1. Your design should include the circuitry for driving and terminating the transmission line with minimum delay starting from a minimum size inverter. What is the delay to send a bit from this inverter, through the output driver, across 1 cm on the PCB, through a receiver back onto the IC, including any settling time necessary? [10pts] Show circuit and sizing Z=50, L=1cm trans. line W=1 W=4 W=1 W=16 W=64 W=256 W=1024 W=1800 The simplest solution to properly drive the wire is series termination at the source. We want the equivalent resistance of the driver to be 50Ω. The sink then should be open circuit, which is roughly what we get when loaded with the small buffer. R src = R 0 90000 = 50Ω, so we get W = = 1800. W 50 We scale up to the final W = 1800 buffer geometrically. W = 4, 16, 64, 256, 1024, then the final W = 1800 buffer to drive the line. This gives 10τ per 4 stage up stage plus 2 + 3600 =5.6τ for the final stage. 1024 Once we hit the wire, it travels at w = c 4 1 = c/2 = 1.5 10 8 m/s. The wire is 0.01 1cm long, so this takes. = 0.67 10 10 s=67ps. 1.5 10 8 Symbolic 55.6τ + 1cm c/ µ r ɛ r Absolute 100ps+67ps 170ps (g) For the above case, assuming a pulse width equal to the delay found in part (f), what is the energy per bit transmitted for this off-chip transmission line scenario? [5pts] The transmission line appears resistive to the driver. With proper termination, it is eventually terminated by a resistive load. Consequently, during the period of the pulse, the buffer drives current continuously rather than simply charging a capacitance. P = V I = V 2. E = P dt = P t R pulse = V 2 R t pulse Each plus is a low and high transition of the stage up buffers. Symbolic V 2 Z 0 t pulse +2(1800+1024+256+64+16+4+1)C 0 (V dd ) 2 Absolute 2.75pJ + 0.13pJ 3pJ 6

(h) How does the energy and delay for the transmission line case change when we scale to an 11nm process? PCB parameters are unchanged. Process technology parameters at 11nm are: V dd =670mV C 0 = 0.5 10 17 F I d,sat0 = 8µA R wire = 7MΩ/cm C wire = 1.4pF/cm Give the final energy and delay for transmission line case at 11nm. [5pts] For the new technology, R 0 = 0.67 84KΩ. τ=0.41ps Final buffer is now 8 10 6 closer to W = 1700. Use same buffering to stage up. Buffer delay becomes 50 + 2 + 2 1700 = 55.3τ. The transmission line remains unchanged for delay. 1024 Buffer capacitance is now 6200C 0 instead of 6300C 0, C 0 is one fourth the size, and V dd is about 25% lower. So, the contribution from the buffers becomes 0.014pJ. If we drive at V dd, that saves us a factor of ( ) 2 900 670 = 1.8. The pulse is now 90ps rather than 170ps saving a factor of 1.9. This gives 0.8pJ for the transmission line. Delay Energy Trans. Line (f) 23ps+67ps=90ps 0.81pJ 7

2. Sequentially Accessed Memory. In this problem, we will consider an N-bit memory that we wish to read from sequentially (read 0, read 1,... read N-1, read 0, read 1...). For this problem, ignore wire delay and wire capacitance. You will add that in Problem 3. Answer for this question will include N as a parameter, but otherwise should be reduced to constants. Show your symbolic formulation, circuit assumptions, and sizing details. (a) Consider using the or-tree shown. Each stage uses one bit of the address to select the subtree, and only sends the address down the selected subtree. Provided optimize transistor-level logic in each tree stage to reduce the delay and report the resulting total delay per bit read. You may (should) change the exact decomposition into gates, choice of gates, and sizing as long as you achieve the same logical behavior. [10pts] read enable address counter read data out read enable Mem bit W=1 data bit both access transistors have W=1; R0 drive for Mem bit. en a3 a2 a1 read data L.enable R.enable L.a1 R.a1 L.a0 R.a0 L.data R.data See next page for figure correction. 8

Should be: read enable address counter read data out read enable Mem bit W=1 data bit both access transistors have W=1; R0 drive for Mem bit. en a2 a1 a0 read data L.enable R.enable L.a1 R.a1 L.a0 R.a0 L.data R.data (answers on following page) 9

Show optimized tree stage logic at transistor level. Hierarchical schematics acceptable. Annotate delay of gates at the logic gate stage level along the critical path. Mem bit 4t orig (5t opt) Annotated delay for leaf: Annotated delay for components of original and optimized critical path: Given en for L or R subtree 3t Optimized ai 4t 12t fanout to i of these at stage that resolves ai 5t 8t fanout to i of these at stage that resolves ai en subtree aj (j<i) 6t 2(2+i)t 12t (5t at leaf) 1 self + 2 subtree enables + i 1 ands at this level = 2+i s2 2(2+i)t (inverter only at leaf) 16t (12t at leaf) /aj 12t 4t Show two stages 12t 4t 9t 9t Two stages s2 s2 s2=w=2 on series transistors Keeping eveything in the positive polarity demands extra buffers. By using a negative polarity for the address output, we can save one inverter in the address path. By alternating polarities at tree stages, we save the inverter in the or path. We can size the series transistors up to balance delays in the nor and nand gates. Note that we do not size up the nand gate at the high fanout point. At higher tree levels, the fanout of the enable becomes large and should be buffered. For simplicity, we omit that optimization in this analysis. We pay 5τ for the final address inverter at the leaf (but save 4τ in loading), 3τ for the inverter to turn off the pull down, then 5τ for the enable transistor to drive the first nor at the bottom of the tree. 9τ. Each stage requires 9τ for the optimized or logic. 9 log 2 (N)τ. Addressing at each stage requires: (5+8+4+4i+16)τ =(33+4i)τ 33 log 2 (N)τ + 4τ log 2 (N) i=0 (i) ( 33 log 2 (N) + 2 log 2 2(N) ) τ (6) Symbolic (2 log 2 2 (N) + 42 log 2 (N) + 9 ) τ Absolute log 2 2 (N)3.6ps+log 2 (N)76ps+16ps 10

This was too complicated for the time available on the exam. Things we were looking for: proper Elmore delay of logic some optimization of the logic include both addressing (down) and or (up) paths get proper dependence on N for stages (log(n)) note fanout (but we acknowledge fully capturing sigma impact of fanout was too much to ask) 11

(b) For the above design, what is the average energy per bit read? i. on the data read path (from the enabled memory cell to the data read output). [5pts] C read = (3 + 4γ) C 0 log 2 (N) + 2γC 0 Data Symbolic 0.5 ((3 + 4γ) C0 log 2 (N) + 2γC 0 ) (V dd ) 2 Read Absolute (5.7 log2 (N) + 1.6) 10 17 J Scope of this piece was probably reasonable. ii. on the address path (from the counter to the enables at the memory cells). [5pts] Hint: How much energy is switched for when address bit a i switches? How many times does bit a i switch when reading through the entire memory? You may use an upper bound approximation that is within a factor of two or leave results formulated as a summation. The address enable logic is design to send the address only down one subtree at a time. When the i-th bit changes, the address will be rerouted down the opposite subtree. a i switches every 2 i -th cycle. It is rerouted every 2 i+1, 2 i+2,... cycle as more significant bits of the address change. So, it toggles with probability j i (2 j ) 2 1 i cycles. When it toggles, it switches in one tree stage at each level from the top (level log 2 (N) down to level i. Summing the toggling across a i : log 2 (N) i=0 ( 2 1 i (log 2 (N) i) ) 4 log 2 (N) (7) A toggling address bit toggles capacitance 16C 0 at a level (load of 3 on inputs of nand, 3 for self load of nand, 2 for inverter it drives for each of the 2 nand gates.) Each enable at level i toggles every 2 i -th cycle based on address and every 2 i+1, 2 i+2,... cycles as more significant bits cause enable changes. So this toggle looks like the address toggling for a total toggle probability of 2 1 i. When an enable toggles it switches capacitance (2i + 44) C 0. log 2 (N) i=0 ( 2 1 i (2i + 44) C 0 ) (4 log2 (N) + 88) C 0 (8) Actually, the log 2 (N) term will converge to a constant as well, making this a weak upper bound. However, since we already have a larger log 2 (N) term for the address, the impact of this part of the over approximation is small. 12

Addressing Symbolic (68 log2 (N) + 88) C 0 (V dd ) 2 Absolute (110 log2 (N) + 140) 10 17 J This was too complicated for the exam. We were looking for: observe different toggling frequencies so means much less than assuming all toggle on every cycle at least get log(n) stage dependence see that this doesn t fan out to everything so not have linear dependence formulation of capacitance per stage Convergence of summation was beyond the scope of the exam. 13

(c) What benefits do we get for delay and energy from using an or gate at each tree level in the previous design as opposed to a monlithic pass-gate mux design as shown below? Give an explanation using equations, but you do not need to calculate specific constants. [5pts] read enable address counter read data out read enable Mem bit data bit en a3 a2 a1 read data L.enable R.enable L.a1 R.a1 L.a0 R.a0 L.data Explanation: The or-tree case has two key advantages. Data only needs to traverse through log 2 (N) stages. The capacitance per stage is isolated. So, the or case only toggles log 2 (N) capacitance and has delay log 2 2(N) due to log 2 (N) delay per stage and log 2 (N) stages. In contrast, the pass-gate case is directly expoxed to N transistors. This makes the energy switched be proportional to N. It also means the delay is at least proportional to N. The nor gate effectively isolates the path from the leaf to the root, reducing both the delay and the energy required. This piece, at least, was reasonable. You should be able to see the major effects without fully working out all the capacitance and activity details needed for the earlier parts of the problem. R.data 14

3. Scaling of Sequentially Accessed Memory. For this problem, we consider scaling the memory to a full chip with N = 2 33. We also consider the impact of wire delay in the or-tree (2a). Assume the top wire in the tree is 1cm long and every second tree level the wire lengh halves (1cm, 1cm, 0.5cm, 0.5cm, 0.25cm,...). When wires are longer than L seg =3.3mm, buffer wires with W = 74 inverters as in 1d. Size up the or gates to match the wire buffering. Hint: You should be able to adapt your results from problems 1d and 2a to answer a and b. Answers for this question should be reduced to absolute constants. Show your symbolic formulation and sizing details. (a) What is the delay to read a bit? [5pts] Total wire length is 4cm down and 4cm up, for a total of 8cm. We know the delay for 1cm of wire buffered this way from problem 1d. So, the wire delay is around 8 T 1d = 8 210ns 1700ns Assume we scale everything in the or-tree up by 74 to match the wires. Then the delay within the or-tree remains the same. Using problem 2a, we know the delay is 33 2 3.6ps + 33 76ps + 16ps = 6.4ns. The memory cell stays the same, so memory driving the first or-gate is actually slower, around 74 3τ 0.4ns. Symbolic 8T1d + T 1a ( N = 2 33 ) + W buf 3τ Absolute 1700ns With proper wire buffering, this is closer to 75ns (b) What fraction of the delay is due to the wires and wire buffering? (if C wire = 0, what fraction of the delay would go away?) [5pts] T 1d / (T 1d + T 1a (N = 2 33 )) = 1664 1664+6.7 Absolute 99.6% 15

(c) At your identified delay time (assuming you use that as the cycle time) and assuming each memory bit leaks at 2I sd,leak0 and accounting for leakage within the addressing and or-tree, what is the leakage energy per read cycle? [5pts] Hint: There are roughly four times as many total and gates in the original tree as or gates. You may use this relation to simplify your calculation. 2 33 memory bits. Each leaking as 2I sd,leak0. 2 32 nor gates at leaf 2 33 total nor gates. Each leaking as 2I sd,leak0. 4 times as many nand gates, makes 2 35. Each also leaking 2I sd,leak0. Fewer inverters than nand gates, so let s assume they are the same number to keep it simple 2 35. Each of these leaks as I sd,leak0. I total leak = (3 2 35 + 4 2 33 ) I sd,leak0 = 2 37 I sd,leak0 = 41mA Symbolic Vdd I total leak T 3a Absolute 56nJ Why 4?...actually it should be 5. There are as many enable and gates as there are or gates. There are twice as many a 0 and gates as there are or gates (2 because there is a left and right one at each stage). There are half as many a 1 and gates as a 0 and gates...and half as many a 2 as a 1... So, in total there are 2 2 = 4 times as many address and gates as or gates. Add the address and enable and gates and we get 5 times as many as there are or gates. 16

4. Bus data sequence. Consider the address bus from the previous problem. Particularly, consider the wires at the top of the tree that run side-by-side for a long distance. For the sequential access case, if we use a simple binary counter to address the memory, the value transitions on this bus are not random. If we are only accessing the values in sequence, we could recode the address sequence to use a Gray Code that has the property that successive values differ in a single bit position (e.g. for a 3-bit Gray Code: 000 001 011 010 110 111 101 100). Assume the wire-to-wire capacitance is equal to the wire-to-ground capacitance. Without loss of generality, consider the timing impact on the middle bit in a 3-bit bus. (a) Identify how the worst-case transmission speed of the middle bit differs in the three cases (random data, binary-counter sequence, Gray Code sequence). For full credit, explain why it is different in each case and estimate the magnitude of the difference. [5pts] 17

Case Delay Explain (compare random) random 0 It is possible for the middle bit to switch in the opposite direction of the two wires that surround it. Consider a 010 to 101 transition. Here, we pay 2C w2w capacitance to each of the adjacent wires for a total of 4 binary -20% C w2w capacitance on top of the C w2g capacitance. Total Cap=5C w2g It will never be the case that both the surrounding bits are changing in opposite directions. Consider: 000 001 010 011 100 101 110 111. The worst case is when one adjacent bit switches in the opposite direction (e.g. 001 to 010, 101 to 110). So, worst-case we pay 2C w2w capacitance for double switching on one side plus C w2w capacitance for single switching on the other Gray Code -40% (001 to 010) on top of the C w2g capacitance. Total Cap=4C w2g Since only one wire switches, we never have simultaneous switching. We may need to charge the adjacent wires for a total addition of 2C w2w capacitance, but never for a 2 voltage switch. Total Cap=3C w2g 18

The adjacent wire drivers will also be helping fill the wire-to-wire capacitance. If (as they likely are) they are sized the same, the burden for switching the wire-towire capacitance is split, reducing the impact. If we assume half of the charge is provided by the driver on each side, then the relative capacitances are 3, 2.5, and 2. Same trend, just slighly different absolute percentages. 19

(b) Assuming the counter and the bus buffers for the entire bus all share a V dd pin and a ground pin, how does the worst-case noise on these power pins differ between the two cases (binary counter, Gray Code counter). For full credit, explain why it is different in each case and estimate the magnitude of the difference. [5pts] Let B be the number of bits in the bus. Case binary 0 counter Gray Code Change in Explain noise level Here, all lines could switch, for a I of B I d,sat Here, only one line could switch for a I of I d,sat counter 1 B Noise voltage goes as LdI/dt, so the noise is a factor of B lower. 20

Name: Answers. Mean: 38, Standard Deviation: 15. ESE370 Fall 2012