Finding good prefix networks using Haskell. Mary Sheeran (Chalmers)

Similar documents
Area-Time Optimal Adder with Relative Placement Generator

EXAMINATION in Hardware Description and Verification

Boolean Circuit Optimization

Optimum Prefix Adders in a Comprehensive Area, Timing and Power Design Space

PanHomc'r I'rui;* :".>r '.a'' W"»' I'fltolt. 'j'l :. r... Jnfii<on. Kslaiaaac. <.T i.. %.. 1 >

Homework 4 due today Quiz #4 today In class (80min) final exam on April 29 Project reports due on May 4. Project presentations May 5, 1-4pm

EECS150 - Digital Design Lecture 23 - FSMs & Counters

ECE 645: Lecture 3. Conditional-Sum Adders and Parallel Prefix Network Adders. FPGA Optimized Adders

Boolean Algebra. Digital Logic Appendix A. Postulates, Identities in Boolean Algebra How can I manipulate expressions?

For smaller NRE cost For faster time to market For smaller high-volume manufacturing cost For higher performance

Boolean Algebra. Digital Logic Appendix A. Boolean Algebra Other operations. Boolean Algebra. Postulates, Identities in Boolean Algebra

Computational Boolean Algebra. Pingqiang Zhou ShanghaiTech University

Optimum Prefix Adders in a Comprehensive Area, Timing and Power Design Space

ISSN (PRINT): , (ONLINE): , VOLUME-4, ISSUE-10,

Sequential Logic Optimization. Optimization in Context. Algorithmic Approach to State Minimization. Finite State Machine Optimization

Lecture 4. Adders. Computer Systems Laboratory Stanford University

Digital Logic Appendix A

CS 140 Lecture 14 Standard Combinational Modules

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

Learning Outcomes. Spiral 2 4. Function Synthesis Techniques SHANNON'S THEOREM

1 Circuits (20 points)

Value-Ordering and Discrepancies. Ciaran McCreesh and Patrick Prosser

Algorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen

Parallelism in Computer Arithmetic: A Historical Perspective

1 Terminology and setup

Arithmetic in Integer Rings and Prime Fields

CSC236 Week 3. Larry Zhang

Binary Adder Circuits of Asymptotically Minimum Depth, Linear Size, and Fan-Out Two

Design of Arithmetic Logic Unit (ALU) using Modified QCA Adder

Skew Management of NBTI Impacted Gated Clock Trees

Design and Implementation of Efficient Modulo 2 n +1 Adder

Week-I. Combinational Logic & Circuits

Section 3: Combinational Logic Design. Department of Electrical Engineering, University of Waterloo. Combinational Logic

Issues on Timing and Clocking

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Digital Logic

Proofs Not Based On POMI

Chapter 5: Selectors, Shifters, Priority Encoders & Half-Decoders

CSE Introduction to Parallel Processing. Chapter 2. A Taste of Parallel Algorithms

Subquadratic space complexity multiplier for a class of binary fields using Toeplitz matrix approach

Digital System Design Combinational Logic. Assoc. Prof. Pradondet Nilagupta

EE/CSCI 451: Parallel and Distributed Computation

Department of ECE, Vignan Institute of Technology & Management,Berhampur(Odisha), India

Discrete Mathematics. CS204: Spring, Jong C. Park Computer Science Department KAIST

University of Toronto Faculty of Applied Science and Engineering Final Examination

Optimal Metastability-Containing Sorting Networks

Dynamic Programming. Credits: Many of these slides were originally authored by Jeff Edmonds, York University. Thanks Jeff!

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d

Galois Field Algebra and RAID6. By David Jacob

Intro To Digital Logic

Combinational Logic Design

Functional Big-step Semantics

CSE477 VLSI Digital Circuits Fall Lecture 20: Adder Design

Logic Synthesis and Verification

(Refer Slide Time: )

Binary Multipliers. Reading: Study Chapter 3. The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding

Summary of Last Lectures

Introduction to Digital Logic

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

Primitive Recursive Presentations of Transducers and their Products

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Parallel Prefix Sums In An Embedded Hardware Description Language

Concrete models and tight upper/lower bounds

Novel Bit Adder Using Arithmetic Logic Unit of QCA Technology

VLSI Signal Processing

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Lecture 1 : Data Compression and Entropy

Sum of Squares. Defining Functions. Closed-Form Expression for SQ(n)

Automatic Code Generation

EXAM. CS331 Compiler Design Spring Please read all instructions, including these, carefully

Power-speed Trade-off in Parallel Prefix Circuits

VLSI Design. [Adapted from Rabaey s Digital Integrated Circuits, 2002, J. Rabaey et al.] ECE 4121 VLSI DEsign.1

Relaxation du parallélisme synchrone pour les systèmes asynchrones communicant par buffers : réseaux de Kahn n-synchrones

Notes on generating functions in automata theory

S.Y. Diploma : Sem. III [CO/CM/IF/CD/CW] Digital Techniques s complement 2 s complement 1 s complement

CPE/EE 422/522. Chapter 1 - Review of Logic Design Fundamentals. Dr. Rhonda Kay Gaede UAH. 1.1 Combinational Logic

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Domain-Specific Optimization for Skeleton Programs Involving Neighbor Elements

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Recurrences COMP 215

First, let's review classical factoring algorithm (again, we will factor N=15 but pick different number)

Mathematical Induction. Rosen Chapter 4.1,4.2 (6 th edition) Rosen Ch. 5.1, 5.2 (7 th edition)

BDD Based Upon Shannon Expansion

Integer Clocks and Local Time Scales

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Class Website:

Floating Point Representation and Digital Logic. Lecture 11 CS301

Letting be a field, e.g., of the real numbers, the complex numbers, the rational numbers, the rational functions W(s) of a complex variable s, etc.

LOWELL WEEKI.Y JOURINAL

Why Functional Programming Matters. John Hughes & Mary Sheeran

Module 10: Sequential Circuit Design

Discrete Mathematics. CS204: Spring, Jong C. Park Computer Science Department KAIST

Where are we? Data Path Design. Bit Slice Design. Bit Slice Design. Bit Slice Plan

DESIGN OF COMPACT REVERSIBLE LOW POWER n-bit BINARY COMPARATOR USING REVERSIBLE GATES

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

Quiz 2 Room 10 Evans Hall, 2:10pm Tuesday April 2 (Open Katz only, Calculators OK, 1hr 20mins)

COMP 103. Lecture 10. Inverter Dynamics: The Quest for Performance. Section 5.4.2, What is this lecture+ about? PERFORMANCE

EECS 141: FALL 05 MIDTERM 1

Advances in Bayesian Network Learning using Integer Programming

Introduction Algorithms Applications MINISAT. Niklas Sörensson Chalmers University of Technology and Göteborg University

Hw 6 due Thursday, Nov 3, 5pm No lab this week

Testability. Shaahin Hessabi. Sharif University of Technology. Adapted from the presentation prepared by book authors.

Transcription:

Finding good prefix networks using Haskell Mary Sheeran (Chalmers) 1

Prefix Given inputs x1, x2, x3 xn Compute x1, x1*x2, x1*x2*x3,, x1*x2* *xn * is an arbitrary associative (but not necessarily commutative) operator 2

Why interesting? Microprocessors contain LOTS of parallel prefix circuits not only binary and FP adders address calculation priority encoding etc. Overall performance depends on making them fast But they should also have low power consumption... Parallel prefix is a good example of a connection pattern for which it is interesting to do better synthesis 3

Serial prefix least most significant 4

Might expect serr _ [a] = [a] serr op (a:b:bs) = a:cs c = op(a,b) cs = serr op (c:bs) *Main> simulate (serr plus) [1..10] [1,3,6,10,15,21,28,36,45,55] But I am going to prefer building blocks that are themselves pp networks 5

type NW a = [a] -> [a] type PN = forall a. NW a -> NW a bser _ [] = [] bser _ [a] = [a] bser op as = ser bop as bop [a,b] = op[c]++[d] [c,d] = op [a,b] When the operator works on a singleton list, it is a buffer (drawn as a white circle) 6

7

Sklansky 32 inputs, depth 5, 80 operators 8

Sklansky 32 inputs, depth 5, 80 operators 9

skl :: PN skl _ [a] = [a] skl op as = init los ++ ros' (los,ros) = (skl op las, skl op ras) ros' = fan op (last los : ros) (las,ras) = halvelist as plusop[a,b] = [a, a+b] *Main> (skl plusop) [1..10] [1,3,6,10,15,21,28,36,45,55] 10

Brent Kung fewer ops, at cost of being deeper. Fanout only 2 11

Ladner Fischer NOT the same as Sklansky; many books and papers are wrong about this 12

Question How do we design fast low power prefix networks? 13

Answer Generalise the above recursive constructions Use dynamic programming good solution to search for a Use Wired to increase accuracy of power and delay estimations 14

BK recursive pattern P is another half size network operating on only the thick wires 15

BK recursive pattern generalised Each S is a serial network like that shown earlier 16

4 2 3 4 This sequence of numbers determines how the outer layer looks 17

wrp ds p comp as = concat rs bs = [bser comp i i <- splits ds as] ps = p comp $ map last (init bs) (q:qs) = mapinit init bs rs = q:[bfan comp (t:u) (t,u) <- zip ps qs] twos 0 = [0] twos 1 = [1] twos n = 2:twos (n-2) bk _ [a] = [a] bk comp as = wrp (twos (length as)) bk comp as

4 2 3 4 So just look at all possibilities for this sequence and for each one find the best possibility for the smaller P Then pick best overall! Dynamic programming 19

Search! need a measure function (e.g. number of operators) Need the idea of a context into which a network (or even just wires) should fit type Context = ([Int],Int) data PPN = Pat PN Fail delf :: NW Int delf [a] = [a+1] delf [a,b] = [m,m+1] m = max a b try :: PN -> Context -> PPN try p (ds,w) = if and [o <= w o <- p delf ds] then Pat p else Fail 20

Need a variant of wrp that can fail, and that makes the crossing over wires explicit (because they might not fit either) wrp2 :: [Int] -> PPN -> PPN -> PPN wrp2 ds (Pat wires) (Pat p) = Pat r r comp as = concat rs bs = [bser comp i i <- splits ds as] qs = wires comp $ concat (mapinit init bs) ps = p comp $ map last (init bs) (q:qs') = splits (mapinit sub1 ds) qs rs = q:[bfan comp (t:u) (t,u) <- zip ps qs'] wrp2 _ = Fail 21

parpre f1 g ctx = getans (error "no fit") (prefix f1 ctx) prefix f = memo pm pm ([i],w) = trywire ([i],w) pm (is,w) 2^maxd(is,w) < length is = Fail pm (is,w) = ((beston is f).dropfail) [wrpc ds (prefix f) ds <- topds g h lis] h = maxd(is,w) lis = length is wrpc ds p = wrp2 ds (trywire (ts,w-1)) (p (ns,w-1)) bs = [bser delf i i <- splits ds is] ns = map last (init bs) ts = concat (mapinit init bs) 22

wso f1 g ctx = getans (error "no fit") (prefix f1 ctx) f1 is the measure function being prefix optimised f = memo for pm pm ([i],w) = trywire ([i],w) pm (is,w) 2^maxd(is,w) < length is = Fail pm (is,w) = ((beston is f).dropfail) [wrpc ds (prefix f) ds <- topds g h lis] h = maxd(is,w) lis = length is wrpc ds p = wrp2 ds (trywire (ts,w-1)) (p (ns,w-1)) bs = [bser delf i i <- splits ds is] ns = map last (init bs) ts = concat (mapinit init bs) 23

wso f1 g ctx = getans (error "no fit") (prefix f1 ctx) prefix f = memo g is max pm width of small F networks. Controls fanout. pm ([i],w) = trywire ([i],w) pm (is,w) 2^maxd(is,w) < length is = Fail pm (is,w) = ((beston is f).dropfail) [wrpc ds (prefix f) ds <- topds g h lis] h = maxd(is,w) lis = length is wrpc ds p = wrp2 ds (trywire (ts,w-1)) (p (ns,w-1)) bs = [bser delf i i <- splits ds is] ns = map last (init bs) ts = concat (mapinit init bs) 24

wso f1 g ctx = getans (error "no fit") (prefix f1 ctx) prefix f = memo pm pm ([i],w) = trywire ([i],w) pm (is,w) 2^maxd(is,w) < length is = Fail pm (is,w) = ((beston use memoisation f).dropfail) to avoid [wrpc expensive ds (prefix recomputation f) ds <- topds g h lis] h = maxd(is,w) lis = length is wrpc ds p = wrp2 ds (trywire (ts,w-1)) (p (ns,w-1)) bs = [bser delf i i <- splits ds is] ns = map last (init bs) ts = concat (mapinit init bs) 25

wso f1 g ctx = getans (error "no fit") (prefix f1 ctx) prefix f = memo pm pm ([i],w) = trywire ([i],w) pm (is,w) 2^maxd(is,w) < length is = Fail pm (is,w) = ((beston base case: is f).dropfail) single wire [wrpc ds (prefix f) ds <- topds g h lis] h = maxd(is,w) lis = length is wrpc ds p = wrp2 ds (trywire (ts,w-1)) (p (ns,w-1)) bs = [bser delf i i <- splits ds is] ns = map last (init bs) ts = concat (mapinit init bs) 26

wso f1 g ctx = getans (error "no fit") (prefix f1 ctx) prefix f = memo pm pm ([i],w) = trywire ([i],w) pm (is,w) 2^maxd(is,w) < length is = Fail pm (is,w) = ((beston is f).dropfail) [wrpc ds (prefix f) ds <- topds g h lis] h = maxd(is,w) lis = length is Fail if it is simply impossible to fit a prefix network in the available depth wrpc ds p = wrp2 ds (trywire (ts,w-1)) (p (ns,w-1)) bs = [bser delf i i <- splits ds is] ns = map last (init bs) ts = concat (mapinit init bs) 27

wso f1 g ctx = getans (error "no fit") (prefix f1 ctx) prefix f = memo pm pm ([i],w) = trywire ([i],w) pm (is,w) 2^maxd(is,w) < length is = Fail pm (is,w) = ((beston is f).dropfail) [wrpc ds (prefix f) ds <- topds g h lis] h = maxd(is,w) lis = length is wrpc ds p = wrp2 ds (trywire (ts,w-1)) (p (ns,w-1)) Generate candidate sequences bs = [bser delf i i <- splits ds is] ns = map last Here (init is bs) the cleverness is ts = concat (mapinit init bs) I keep them almost sorted 28

wso f1 g ctx = getans (error "no fit") (prefix f1 ctx) prefix f = memo pm pm ([i],w) = trywire ([i],w) pm (is,w) 2^maxd(is,w) < length is = Fail pm (is,w) = ((beston is f).dropfail) [wrpc ds (prefix f) ds <- topds g h lis] h = maxd(is,w) lis = length is For each candidate sequence: wrpc ds p = wrp2 ds Build (trywire the resulting (ts,w-1)) network (p (ns,w-1)) ( call of (prefix f) gives the bs = [bser delf best i network i <- for splits the recursive ds is] call ns = map last (init inside) bs) ts = concat (mapinit init bs) 29

wso f1 g ctx = getans (error "no fit") (prefix f1 ctx) Figures out the contexts for the wires and the call of p in prefix f = memo pm a call of wrp2 pm ([i],w) = trywire ([i],w) pm (is,w) 2^maxd(is,w) < length is = Fail pm (is,w) = ((beston is f).dropfail) [wrpc ds (prefix f) ds <- topds g h lis] h = maxd(is,w) lis = length is wrpc ds p = wrp2 ds (trywire (ts,w-1)) (p (ns,w-1)) bs = [bser delf i i <- splits ds is] ns = map last (init bs) ts = concat (mapinit init bs) 30

wso f1 g ctx = getans (error "no fit") (prefix f1 ctx) prefix f = memo pm pm ([i],w) = trywire ([i],w) pm (is,w) 2^maxd(is,w) < length is = Fail pm (is,w) = ((beston is f).dropfail) [wrpc ds (prefix f) ds <- topds g h lis] h = maxd(is,w) lis = length is wrpc ds p = wrp2 ds Finally, (trywire pick the (ts,w-1)) best among (p (ns,w-1)) all these candidates bs = [bser delf i i <- splits ds is] ns = map last (init bs) ts = concat (mapinit init bs) 31

Result when minimising number of ops, depth 6, 33 inputs, fanout 7 This network is Depth Size Optimal (DSO) depth + number of ops = 2(number of inputs)-2 (known to be smallest possible no. ops for given depth, inputs) 6 + 58 = 2*33 2 BUT we need to move away from DSO networks to get shallow networks with more than 33 inputs 32

A further generalisation 33

Result When minimising no. of ops: gives same as Ladner Fischer for 2^n inputs, depth n, considerably fewer ops and lower fanout else (non power of 2 or not min. depth) Promising power and speed when netlists given to Design Compiler 34

Result (more real) Use Wired, a system for low level wire-aware hardware design developed by Emil Axelsson at Chalmers To link to Wired, need slightly fancier context since physical position is important Can minimise for (accurately estimated) speed in P1 and for power in P2 (two measure functions) 35

Link to Wired allows more accurate estimates. Can then explore design space 36

Can also export to Cadence SoC Encounter Need to do more to make realistic circuits (buffering of long wires, sizing of cells) 37

And the search space gets even larger if one allows operators with more than 2 inputs. So there is more fun to be had. 38

Conclusion Search based on recursive decomposition gives promising results Need to look at lazy dynamic programming Need to do some theory about optimality (taking into account fanout) Will try to apply similar ideas in data parallel programming on GPU ( scan is also important) 39