L15: Custom and ASIC VLSI Integration Average Cost of one transistor 10 1 0.1 0.01 0.001 0.0001 0.00001 $ 0.000001 Gordon Moore, Keynote Presentation at ISSCC 2003 0.0000001 '68 '70 '72 '74 '76 '78 '80 '82 '84 '86 '88 '90 '92 '94 '96 '98 '00 '02 Acknowledgement: J. Rabaey, A. Chandrakasan, B. Nikolic, igital Integrated Circuits: A esign Perspective Prentice Hall, 2003. Curt Schurgers L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 1
Moore s Law LOG 2 OF THE NUMBER OF COMPONENTS PER INTEGRATE FUNCTION Transistors Shipped Per Year 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 10 18 10 17 10 16 10 15 10 14 10 13 10 12 10 11 10 10 10 9 Electronics, April 19, 1965. 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 Gordon Moore, Keynote Presentation at ISSCC 2003 Source: ataquest/intel, 12/02 E.O. Wilson, the famous Harvard biologist who is an expert on ants, estimates that there are 10 to the 16th and 10 to the 17th ants on earth. But if you look at this curve, this year we re making one transistor for every ant. Gordon Moore, An Update on Moore s Law '68 '70 '72 '74 '76 '78 '80 '82 '84 '86 '88 '90 '92 '94 '96 '98 '00 '02 In 1965, Gordon Moore was preparing a speech and made a memorable observation. When he started to graph data about the growth in memory chip performance, he realized there was a striking trend. Each new chip contained roughly twice as much capacity as its predecessor, and each chip was released within 18-24 months of the previous chip. If this trend continued, he reasoned, computing power would rise exponentially over relatively brief periods of time. P6 Pentium proc 486 386 286 8085 8086 8080 4004 8008 L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 2 Transistors (MT) 1000 100 10 1 0.1 0.01 0.001 2X growth in 1.96 years! 1970 1980 1990 2000 2010 Year Courtesy of S. Borkar (Intel)
Layout 101 3- Cross-Section V p-type substrate n-type well metal/pdiff contact W p L p N-channel MOSFET P-channel MOSFET IN OUT V W n G S Circuit Representation GN L n contact frommetal to ndiff IN OUT metal poly Layout n+ diff p+ diff G S Follow simple design rules (contract between process and circuit designers) L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 3
Custom esign/layout 9-1 Mux 5-1 Mux a CARRYGEN g64 Itanium has 6 integer execution units like this node1 ck1 SUMSEL REG sum sumb to Cache 9-1 Mux 2-1 Mux b SUMGEN + LU s0 s1 1000um LU : Logical Unit From register files / Cache / Bypass Multiplexers Shifter Adder stage 1 Loopback Bus Loopback Bus Wiring Adder stage 2 Wiring Loopback Bus ie photograph of the Itanium integer datapath Bit slice 63 Adder stage 3 Sum Select Bit slice 2 Bit slice 1 Bit slice 0 Bit-slice esign Methodology To register files / Cache Hand crafting the layout to achieve maximum clock rates (> 1Ghz) Exploits regularity in datapath structure to optimize interconnects L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 4
The ASIC Approach esign Capture Behavioral esign Iteration Pre-Layout Simulation Post-Layout Simulation Verilog (or (or VHL )) Logic Synthesis Floorplanning Placement Structural Physical Circuit Extraction Routing Tape-out Most Common esign Approach for esigns up to 500Mhz Clock Rates L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 5
Standard Cell Example Power Supply Line (V ) elay in (ns)!! 3-input NAN cell (from ST Microelectronics): C = Load capacitance T = input rise/fall time Ground Supply Line (GN) Each library cell (FF, NAN, NOR, INV, etc.) and the variations on size (strength of the gate) is fully characterized across temperature, loading, etc. L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 6
Standard Cell Layout Methodology 2-level metal technology Current ay Technology Cell-structure hidden under interconnect layers With limited interconnect layers, dedicated routing channels between rows of standard cells are needed Width of the cell allowed to vary to accommodate complexity Interconnect plays a significant role in speed of a digital circuit L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 7
Verilog to ASIC Layout (the push button approach) module adder64 (a, b, sum); input [63:0] a, b; output [63:0] sum; After Synthesis assign sum = a + b; endmodule After Routing After Placement L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 8
The esign Closure Problem V BUS d 1 l 1 d 2 l 2 C L C C I λ = = 5 C I L Wire-to-wire capacitance causes inter-wire delay dependencies C L Iterative Removal of Timing Violations (white lines) L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 9
Macro Modules 256 32 (or 8192 bit) SRAM Generated by hard-macro module generator Generate highly regular structures (entire memories, multipliers, etc.) with a few lines of code Verilog models for memories automatically generated based on size L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 10
Clock istribution Q Clock skew, courtesy Alpha Q For 1Ghz clock, skew budget is 100ps. Variations along different paths arise from: evice: V T, W/L, etc. Environment: V, C Interconnect: dielectric thickness variation IBM Clock Routing L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 11
The Power Supply Wires are Not Ideal! To V Grid To V Grid To V Grid C coup Receiver C int R d C d river GROUN GRI Pad Pad The IR-drop problem causes internal power supply voltage to be less than the external source Courtesy avid Blaauw L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 12
Analog Circuits: Clock Frequency Multiplication (Phase Locked Loop) up PLL down Intel 486, 50Mhz VCO produces high frequency square wave ivider divides down VCO frequency PF compares phase of ref and div Loop filter extracts phase error information Used widely in digital systems for clock synthesis (a standard IP block in most ASIC flows) Courtesy M. Perrott L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 13
Behavioral Transformations There are a large number of implementations of the same functionality These implementations present a different point in the area-time-power design space Behavioral transformations allow exploring the design space a high-level Optimization metrics: 1. Area of the design 2. Throughput or sample time T S 3. Latency: clock cycles between the input and associated output change 4. Power consumption 5. Energy of executing a task 6. time power area L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 14
Fixed-Coefficient Multiplication Conventional Multiplication Z = X Y Y 3 Y 0 X 0 X 0 Y 1 X 0 Y 2 X 0 Y 3 Y 2 Y 1 X 0 X 3 Y 0 X 2 Y 0 X 1 Y 0 0 Y X 3 Y 1 X 2 Y 1 X 1 Y 1 X 3 Y 2 X 2 Y 2 X 1 Y 2 X 3 Y 3 X 2 Y 3 X 1 Y 3 Z 7 Z 6 Constant multiplication (become hardwired shifts and adds) Z 6 Z 5 Z 5 Z 4 Z 4 Z 3 X 3 Z 3 X 2 X 1 Z 2 Z 1 Z 0 Z = X (1001) 2 1 0 0 1 X 3 X 2 X 1 X 0 X 3 X 2 X 1 X 0 Z 7 X 3 X 2 X 1 X 0 Z 2 Z 1 Z 0 Y = (1001) 2 = 2 3 + 2 0 X << 3 Z shifts using wiring L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 15
Transform: Canonical Signed igits (CS) Canonical signed digit representation is used to increase the number of zeros. It uses digits {-1, 0, 1} instead of only {0, 1}. Iterative encoding: replace string of consecutive 1 s 0 1 1 1 1 2 N-2 + + 2 1 + 2 0 Worst case CS has 50% non zero bits 1 0 0 0-1 2 N-1-2 0 01101111 0 1 1 0 1 1 1 1 0 1 1 1 0 0 0-1 = 10010001 1 0 0-1 0 0 0-1 X << 7 Z << 4 Shift translates to re-wiring L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 16
Algebraic Transformations Commutativity istributivity A B A B A B C A C B A + B = B + A (A + B) C = AB + BC Associativity Common sub-expressions A B B C X Y X X Y C A (A + B) + C = A + (B+C) A B A B L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 17
Transforms for Efficient Resource Utilization A B C E FG H I 1 Time multiplexing: mapped to 3 multipliers and 3 adders 2 distributivity A C B E FG H I 1 Reduce number of operators to 2 multipliers and 2 adders 2 L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 18
A Very Useful Transform: Retiming Retiming is the action of moving delay around in the systems elays have to be moved from ALL inputs to ALL outputs or vice versa Cutset retiming: A cutset intersects the edges, such that this would result in two disjoint partitions of these edges being cut. To retime, delays are moved from the ingoing to the outgoing edges or vice versa. Benefits of retiming: Modify critical path delay Reduce total number of registers L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 19
Retiming Example: FIR Filter x(n) h(0) h(1) h(2) h(3) Symbol for multiplication y( n) y(n) irect form = h( n) x( n) = K i= 0 x( n i) h( i) x(n) associativity of the addition (10) h(0) h(1) h(2) h(3) T clk = 22 ns y(n) (4) retime x(n) h(0) h(1) h(2) h(3) Transposed form T clk = 14 ns y(n) Note: here we use a first cut analysis that assumes the delay of a chain of operators is the sum of their individual delays. This is not accurate. L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 20
Pipelining, Just Another Transformation (Pipelining = Adding elays + Retiming) Contrary to retiming, pipelining adds extra registers to the system add input registers How to pipeline: 1. Add extra registers at all inputs 2. Retime retime L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 21
The Power of Transforms: Lookahead y(n) = x(n) + A y(n-1) x(n) y(n) A loop unrolling x(n) A A 2 y(n) Try pipelining this structure distributivity y(n) = x(n) + A[x(n-1) + A y(n-2)] x(n) y(n) How about pipelining this structure! associativity A A A 2 x(n) y(n) x(n) y(n) A retiming A A 2 2 A 2 precomputed L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 22
Scan Testing shift out shift in ScanShift CLK... Idea: have a mode in which all registers are chained into one giant shift register which can be loaded/ 0 read-out bit serially. Test remaining (combinational) 1 logic by ScanShift (1) in test mode, shift in new values for all register bits thus setting up the inputs to the combinational logic 0 (2) clock the circuit once in normal mode, latching 1 the outputs of the combinational logic back into ScanShift the registers (3) in test mode, shift out the values of all shift in register bits and compare against expected results. Courtesy of C. Terman and IEEE Press L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 23
Trends: Chip in a ay (Matlab/Simulink to Silicon ) Mult1 S reg Mac1 X reg Mac2 Add, Sub, Shift Mult2 Map algorithms directly to silicon - bypass writing Verilog! Courtesy of R. Brodersen L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 24
Trends: Watermarking of igital esigns Fingerprinting is a technique to deter people from illegally redistributing legally obtained IP by enabling the author of the IP to uniquely identify the original buyer of the resold copy. The essence of the watermarking approach is to encode the author's signature. The selection, encoding, and embedding of the signature must result in minimal performance and storage overhead. same functionality, same area, same performance watermark of 4768 bits embedded (courtesy of G. Qu,, M. Potkonjak) L15: 6.111 Spring 2004 Introductory igital Systems Laboratory 25