Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1

Similar documents
Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today

Loop Interchange. Loop Transformations. Taxonomy. do I = 1, N do J = 1, N S 1 A(I,J) = A(I-1,J) + 1 enddo enddo. Loop unrolling.


Advanced Restructuring Compilers. Advanced Topics Spring 2009 Prof. Robert van Engelen

Saturday, April 23, Dependence Analysis

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

Topic-I-C Dataflow Analysis

Dependence Analysis. Dependence Examples. Last Time: Brief introduction to interprocedural analysis. do I = 2, 100 A(I) = A(I-1) + 1 enddo

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Embedded Systems Design: Optimization Challenges. Paul Pop Embedded Systems Lab (ESLAB) Linköping University, Sweden

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Loop Parallelization Techniques and dependence analysis

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Energy-efficient scheduling

CSCI-564 Advanced Computer Architecture

Compiler Optimisation

The Data-Dependence graph of Adjoint Codes

Multiprocessor Scheduling I: Partitioned Scheduling. LS 12, TU Dortmund

ENEE350 Lecture Notes-Weeks 14 and 15

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Special Nodes for Interface

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Data Structures in Java

CS 700: Quantitative Methods & Experimental Design in Computer Science

Evaluation and Validation

Introduction to Computer Science and Programming for Astronomers

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

COVER SHEET: Problem#: Points

Lecture 10: Big-Oh. Doina Precup With many thanks to Prakash Panagaden and Mathieu Blanchette. January 27, 2014

Register Allocation. Maryam Siahbani CMPT 379 4/5/2016 1

VLSI Signal Processing

Classes of data dependence. Dependence analysis. Flow dependence (True dependence)

Dependence analysis. However, it is often necessary to gather additional information to determine the correctness of a particular transformation.

Advanced Compiler Construction

Topic 17. Analysis of Algorithms

EE382V: System-on-a-Chip (SoC) Design

Divisible Load Scheduling

Real-Time Workload Models with Efficient Analysis

EE382V-ICS: System-on-a-Chip (SoC) Design

Algorithms. NP -Complete Problems. Dong Kyue Kim Hanyang University

ICS 233 Computer Architecture & Assembly Language

Analysis of Algorithms [Reading: CLRS 2.2, 3] Laura Toma, csci2200, Bowdoin College

' $ Dependence Analysis & % 1

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Enumerate all possible assignments and take the An algorithm is a well-defined computational

A Simple Architectural Enhancement for Fast and Flexible Elliptic Curve Cryptography over Binary Finite Fields GF(2 m )

CS 4407 Algorithms Lecture 3: Iterative and Divide and Conquer Algorithms

Data Dependences and Parallelization. Stanford University CS243 Winter 2006 Wei Li 1

Automatic Loop Interchange

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Clock-driven scheduling

Algorithms. Outline! Approximation Algorithms. The class APX. The intelligence behind the hardware. ! Based on

Fall 2008 CSE Qualifying Exam. September 13, 2008

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Design. Register Transfer Specification And Design

Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions

Andrew Morton University of Waterloo Canada

Lecture: Pipelining Basics

Professor Fearing EECS150/Problem Set Solution Fall 2013 Due at 10 am, Thu. Oct. 3 (homework box under stairs)

A point p is said to be dominated by point q if p.x=q.x&p.y=q.y 2 true. In RAM computation model, RAM stands for Random Access Model.

Datapath Component Tradeoffs

A Novel Congruent Organizational Design Methodology Using Group Technology and a Nested Genetic Algorithm

Compiling Techniques

Optimization Techniques for Parallel Code 1. Parallel programming models

CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms

CS 161 Summer 2009 Homework #2 Sample Solutions

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

Asymptotic Running Time of Algorithms

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

Homework #2: assignments/ps2/ps2.htm Due Thursday, March 7th.

CIS 4930/6930: Principles of Cyber-Physical Systems

/ : Computer Architecture and Design

LESSON 7.2 FACTORING POLYNOMIALS II

Fall 2011 Prof. Hyesoon Kim

Digital Logic Design: a rigorous approach c

Real-time Scheduling of Periodic Tasks (2) Advanced Operating Systems Lecture 3

Bin packing and scheduling

Static Program Analysis

Timing Driven Power Gating in High-Level Synthesis

Performance, Power & Energy

ECE 5775 (Fall 17) High-Level Digital Design Automation. Scheduling: Exact Methods

Module 1: Analyzing the Efficiency of Algorithms

Analysis of clocked sequential networks

Analysis of Algorithm Efficiency. Dr. Yingwu Zhu

COMP 515: Advanced Compilation for Vector and Parallel Processors. Vivek Sarkar Department of Computer Science Rice University

Algorithms and Programming I. Lecture#1 Spring 2015

Exam Spring Embedded Systems. Prof. L. Thiele

CMP N 301 Computer Architecture. Appendix C

Data-Flow Analysis. Compiler Design CSE 504. Preliminaries

Quantum Computing Approach to V&V of Complex Systems Overview

Real-time Scheduling of Periodic Tasks (1) Advanced Operating Systems Lecture 2

Math Models of OR: Branch-and-Bound

Hardware Acceleration of the Tate Pairing in Characteristic Three

Section Summary. Sequences. Recurrence Relations. Summations Special Integer Sequences (optional)

Hardware Design I Chap. 4 Representative combinational logic

Real-time Systems: Scheduling Periodic Tasks

Discrete-Time Systems

Transcription:

Loop Scheduling and Software Pipelining 2008-04-24 \course\cpeg421-08s\topic-7.ppt 1

Reading List Slides: Topic 7 and 7a Other papers as assigned in class or homework: 2008-04-24 \course\cpeg421-08s\topic-7.ppt 2

ABET Outcome Ability to apply knowledge of basic code generation techniques, e.g. Loop scheduling e.g. software pipelining techniques to solve code generation problems. An ability to identify, formulate and solve loops scheduling problems using software pipelining techniques Ability to analyze the basic algorithms on the above techniques and conduct experiments to show their effectiveness. Ability to use a modern compiler development platform and tools for the practice of above. A Knowledge on contemporary issues on this topic. 2008-04-24 \course\cpeg421-08s\topic-7.ppt 3

Outline Brief overview Problem formulation of the modulo scheduling problem Solution methods Summary 2008-04-24 \course\cpeg421-08s\topic-7.ppt 4

General Compiler Framework Source ME Inter-Procedural Optimization (IPA) Loop Nest Optimization (LNO) Global Optimization (OPT) Innermost Loop scheduling BE/CG Global inst scheduling Reg alloc Local inst scheduling Arch Models Good IPO Good LNO Good global optimization Good integration of IPO/LNO/OPT Smooth information passing between FE and CG Complete and flexible support of inner-loop scheduling (SWP), instruction scheduling and register allocation Executable 2008-04-24 \course\cpeg421-08s\topic-7.ppt 5

Questions? How to formulate the loop scheduling problem? How to model it? How to solve it? 2008-04-24 \course\cpeg421-08s\topic-7.ppt 6

Questions (cont d) Instruction scheduling for code without loops a review Dependence graphs may become cyclic! So, critical path length is less obvious! Is it becoming harder!(?) What new insights are required to formulate and solve it? 2008-04-24 \course\cpeg421-08s\topic-7.ppt 7

Challenges of Loop Scheduling A DDG With Cycles Right strategy? 2008-04-24 \course\cpeg421-08s\topic-7.ppt 8

Observations Execution of good loops tend to be regular and repetitive - a pattern may appear This gives cyclic scheduling problem a new twist! How to efficiently derive a pattern? 2008-04-24 \course\cpeg421-08s\topic-7.ppt 9

Problem Formulation (I) Given a weighted dependence graph, derive a schedule which is time-optimal under a machine model M. Def: A schedule S of a loop L is time-optimal if among all legal schedules of L, no other schedule that is faster than S. Note: There may be more than one time-optimal schedule. 2008-04-24 \course\cpeg421-08s\topic-7.ppt 10

A Short Tour on Data Dependence Graphs for Loops 2008-04-24 \course\cpeg421-08s\topic-7.ppt 11

Basic Concept and Motivation Data dependence between 2 accesses The same memory location Exist an execution path between them One of them is a write Three types of data dependence Dependence graphs Things are not simple when dealing with loops 2008-04-24 \course\cpeg421-08s\topic-7.ppt 12

Types of Data Dependence Flow dependence X = = X... δ Anti-dependence X = = X... -- δ -1 Output dependence X = X =... 0 δ 2008-04-24 \course\cpeg421-08s\topic-7.ppt 13

Data Dependence Example 1: S1: A = 0 S2: B = A S3: C = A + D S4: D = 2 S1 S2 S3 S4 S x δ S y S y depends on S x 2008-04-24 \course\cpeg421-08s\topic-7.ppt 14

Data Dependence Con d Example 2: S1: A = 0 S2: B = A S3: A = B + 1 S4: C = A S1 S2 S3 S4 S 1 S 2 δ δ 0 S 3 : Output-dep -1 S 3 : anti-dep 2008-04-24 \course\cpeg421-08s\topic-7.ppt 15

Should we consider input dependence? = X = X Is the reading of the same X important? Well, it may be! (if we intend to group the 2 reads together for cache optimization!) 2008-04-24 \course\cpeg421-08s\topic-7.ppt 16

Subscript Variables Extension of def-use chains to employ a more precise treatment of arrays, especially in iterative loops. DO I = 1, N A(I + 1) = X(I + 1) + B(I) X(I) = A(I) * 5 ENDDO 2008-04-24 \course\cpeg421-08s\topic-7.ppt 17

Dependence Graph Con d Applications - register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization 2008-04-24 \course\cpeg421-08s\topic-7.ppt 18

Data Dependence in Loops An Example Find the dependence relations due to the array X in the program below: (1) for I = 2 to 9 do (2) X[I] = Y[I] + Z[I] (3) A[I] = X[I-1] + 1 (4) end for Solution To find the data dependence relations in a simple loop, we can unroll the loop and see which statement instances depend on which others: I = 2 I = 3 I = 4 (2) X[2]=Y[2]+Z[2] (3) A[2]=X[1]+1 X[3] =Y[3]+Z[3] A[3] =X[2]+1 X[4]=Y[4]+Z[4] A[4]=X[3]+1 2008-04-24 \course\cpeg421-08s\topic-7.ppt 19

Data Dependence in Loops Con d In our example, there is a loop-carried, lexically forward flow dependence relation. S 2 S 3 (1) Dependence distance = 1 Data dependence graph for statements in a loop. - Loop-carried vs loop-independent - Lexical-forward vs lexical backward 2008-04-24 \course\cpeg421-08s\topic-7.ppt 20

An Example for i = 0 to N - 1 do a: a[i] = a[i - 1] + R[i]; b: b[i] = a[i] + c[i - 1]; c: c[i] = b[i] + 1; a b time i = 0 i = 1 i = 2 0 a end; 1 b Note: We use a token here to represent a flow dependence of distance = 1 II time 2 c a 3 b 4 c a So, iteration interval II = 2 iterations i = 3... c Assume each operation takes 1 cycle and there is only one addition unit! 2008-04-24 \course\cpeg421-08s\topic-7.ppt 21

Software Pipeline: Concept Software Pipeline is a technique that reduces the execution time of important loops by interweaving operations from many iterations to optimize the use of resources. stf fadds ldf cmp bg sub 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 time 2008-04-24 \course\cpeg421-08s\topic-7.ppt 22

The Structrure of the SWP code % prologue a[0] = a[-1] + R[0]; % pattern for i = 0 to N-2 do b[i] = a[i] + c[i -1]; a[i + 1] = a[i] + R[i + 1]; c[i] = b[i] + 1; end; % epilogue b[n - 1] = a[n - 1] + c[n - 2]; c[n - 1] = b[n - 1] + 1 prologue b i, c i, a i+1 epilogue 2008-04-24 \course\cpeg421-08s\topic-7.ppt 23

Software Pipeline (Cont d) What limits the speed of a loop? Data dependencies: recurrence initiation interval (rec_mii) Processor resources: resource initiation interval (res_mii) Initiation interval stf fadds ldf cmp bg sub 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 time 2008-04-24 \course\cpeg421-08s\topic-7.ppt 24

Previous Approaches Approach I (Operational): Emulate the loop execution under the machine model and a pattern will eventually occur [AikenNic88, EbciogluNic89, GaoEtAl91] Approach II (Periodic scheduling): Specify the scheduling problem into a periodical scheduling problem and find optimal solution [Lam88, RauEtAl81,GovindAltmanGao94] 2008-04-24 \course\cpeg421-08s\topic-7.ppt 25

Periodic Schedule (Modulo Scheduling) The time (cycle) when the I-th instance of the operation v is scheduled : t(i, v) = T * i+ A[v] where T = II so t(i + 1, v) -t(i, v) = T*(i + 1) T*(i) = T For our example: t(i, v) = 2i + A[v] where A(a) = 1 A(b) = 0 A(c) = 1 Question: Is this an optimal schedule? 2008-04-24 \course\cpeg421-08s\topic-7.ppt 26

Periodic Schedule (cont d) Yes, the schedule t(i, v) = 2i + A[v] is time-optimal! With II = 2 2008-04-24 \course\cpeg421-08s\topic-7.ppt 27

Restate the problem Given a DDG of a loop L, how to determine the fastest computation rate of L -- also called minimum initiation interval (MII)? Hint: Consider the Critical Cycles as well as critical resource usage in L 2008-04-24 \course\cpeg421-08s\topic-7.ppt 28

Recurrence MII -- RecMII An Example 2008-04-24 \course\cpeg421-08s\topic-7.ppt 29

An Example (Revisit) : RecMII for i = 0 to N - 1 do a: a[i] = a[i - 1] + R[i]; b: b[i] = a[i] + c[i - 1]; c: c[i] = b[i] + 1; end; a b II time time i = 0 i = 1 i = 2 0 a 1 b 2 c a 3 b 4 c a So, RecMII = 2 iterations i = 3... c Assume each operation takes 1 cycle and there is only one addition unit! 2008-04-24 \course\cpeg421-08s\topic-7.ppt 30

Hint: now one must think about deadlines. Maximum Computation Rate Theorem: The maximum computation rate of a loop is bounded by the following ratio r opt (C) = min{ Dc/Wc} where C is a dependence cycle, Dc is the total dependence distance along C and Wc is total execution time of C. i.e. Dc Wc = c = c di wi (d i is the dependence distance along the edge i in C) (w i is the edge weight along the edge i in C) 2008-04-24 \course\cpeg421-08s\topic-7.ppt 31

RecMII (Contn d) And the optimal period MII = T opt = 1/r opt Def: A cycle is critical if the period of the cycle equal to MII (We should write it as RecMII) Note: a loop may have multiple critical cycles! 2008-04-24 \course\cpeg421-08s\topic-7.ppt 32

An Example (Revisit) : RecMII for i = 0 to N - 1 do a: a[i] = a[i - 1] + R[i]; b: b[i] = a[i] + c[i - 1]; c: c[i] = b[i] + 1; end; a b II time time i = 0 i = 1 i = 2 0 a 1 b 2 c a 3 b 4 c a So, RecMII = 2 iterations i = 3... c Assume each operation takes 1 cycle and there is only one addition unit! 2008-04-24 \course\cpeg421-08s\topic-7.ppt 33

How About Machine Resource and Register Constraints? Numbers and types of FUs, etc, (ResMII) Number of registers 2008-04-24 \course\cpeg421-08s\topic-7.ppt 34

Software Pipelining Review of previous work [RauFisher93] [Rau94] Minimum initiation interval [MII] MII = max {RecMII, ResMII} where RecMII: determined by critical recurrence cycles in DDG ResMII: determined by resource constraints (#s and types of FUs) 2008-04-24 \course\cpeg421-08s\topic-7.ppt 35

Consider a simple example of n nodes and 1 FU -- what is ResMII? The Resource Constraint MII (ResMII) Can be calculated by totaling, for each resource, the usage requirement imposed by one iteration of the loop can be done by bin-packing a resource reservation table (expensive) usually derive a lower-bound is enough as a starting point to begin the search process More price with complex hardware pipelines? 2008-04-24 \course\cpeg421-08s\topic-7.ppt 36

Example 1: Reservation Table - An Example 1 Stage 0 1 2 3 4 5 3 1 2 X X X X X 2 3 X (a) a Pipeline M (b) The Reservation Table of M 2008-04-24 \course\cpeg421-08s\topic-7.ppt 37

A Quiz? What is the RT of a fully pipelined adder with 3 stages? How about an unpipelined adder? How to computer ResMII for each case? 2008-04-24 \course\cpeg421-08s\topic-7.ppt 38

Modulo Reservation Table The modulo reservation table only has length = II Each entry in the table records a sequence of reservations corresponds a sequence of slots every II cycles Hints: think about your weekly calendar 2008-04-24 \course\cpeg421-08s\topic-7.ppt 39

Modulo Scheduling MII II = II + 1 Choose Next Operation X failed try to place x in a time slot i in II A legal schedule is found succeed Remove from L is L =<>? No 2008-04-24 \course\cpeg421-08s\topic-7.ppt 40

Heuristic Method for Modulo Scheduling Why a simple variant list scheduling may not work? Hint: consider the deadline constraints of operations in a cycle. 2008-04-24 \course\cpeg421-08s\topic-7.ppt 41

Counter Example I: List Scheduling May Fail! (a) [1] A 4 C 2 B 2 D 4 (b) MII = RecMII = 4 Note: if simple list scheduling is used B cannot be scheduled due to the deadline set by scheduling C: deadlock! 0 1 2 3 A B C D (c) MII = RecMII = 4 Note: we cannot fire C as early as possible! 2008-04-24 \course\cpeg421-08s\topic-7.ppt 42 0 1 2 3 A B C D

Example I (Cont d) In previous figure, We show an example demonstrating the problem with greedy scheduling in the presence of recurrences. (a) The data dependence graph with a cycle. (b) The resulting partial schedule when C is scheduled greedily. B cannot be scheduled. (c) The resulting valid schedule when C is not scheduled greedyly delayed to two cycles later. 2008-04-24 \course\cpeg421-08s\topic-7.ppt 43

Example 2: Problems with Greedy schedule due to ResMII A1 C2 A3 A4 M5 2 1 2 2 3 M6 3 0 1 2 3 4 5 Adder Mult Bus A1 M6 0 1 A4 C2 2 A3 3 4 5 Adder Mult Bus A1 M6 C2 A4 A3 M5 ResMII = 6 ResMII = 6 (a) DDG A: non-pipelined adders M: non-pipelined multipliers (b) A greedy schedule which (c) a non-greedy schedule cannot schedule A4 with which achieves ResMII ResMII 2008-04-24 \course\cpeg421-08s\topic-7.ppt 44

Example 2 (Cont d) In previous figure, We show an example demonstrating the problem with greedy scheduling in the presence of complex reservation tables. (a) The data dependence graph without cycle. (b) The resulting partial schedule when A1, M6, C2, and A3 are scheduled greedily. A4 cannot be scheduled. (c) The resulting valid schedule when A3 is scheduled one cycle later. 2008-04-24 \course\cpeg421-08s\topic-7.ppt 45

Example 3: infeasibility of MII Con d A1 2 M2 3 MA3 5 Adder Mult Bus Adder Mult Bus Adder Mult Bus Note: ResMII = 2. But is there a legal schedule under II = 2? Note: The presence of complex reservation tables 2008-04-24 \course\cpeg421-08s\topic-7.ppt 46 (a)

Example 3 (cont d) Adder Mult Bus A1 M2 M2 Adder Mult Bus A1 M2 A1 MA3 A1 MA3 M2 MII = 2 II = 2 (b) cannot fit MA3 under II=2 MII = 2 II = 3 (c) A feasible schedule for II=3 Note: It is possible that there is no valid schedule at MII! 2008-04-24 \course\cpeg421-08s\topic-7.ppt 47

Infeasibility of MII Con d The previous slide shows an example demonstrating the infeasibility of the MII in the presence of complex reservation tables. (a) The three operations and their reservation tables. (b) The MRT corresponding to the dead-end partial schedule for an II of 2 after A1 and M2 have been scheduled. (c) The MRT corresponding to a valid schedule for an II of 3. 2008-04-24 \course\cpeg421-08s\topic-7.ppt 48

Example 4: Infeasibility of MII Assume: fully pipelined adders A1 A2 2 2 A3 2 Adder Mult Bus 2 A3 2 [2] A4 2 2 A4 2 ResMII = 4 RecMII = 4 (a) Note: It is possible that there is no valid schedule at MII! And, the reservation table here is simple! 2008-04-24 \course\cpeg421-08s\topic-7.ppt 49 (b)

Example 4 (cont d) The previous slides shows an example demonstrating the infeasibility of the MII due to the interaction between the recurrence constraints and the resource usage constraints. (a) The data dependence graph. (b) The MRT corresponding to the dead-end partial schedule for an II of 4 after A1 and A2 have been scheduled. 2008-04-24 \course\cpeg421-08s\topic-7.ppt 50

How to derive a best feasible schedule? It is possible do so via exhaustive search. But, it is expensive! 2008-04-24 \course\cpeg421-08s\topic-7.ppt 51

Software Pipelining A Taxonomy of Software Pipelining Operational Approach Periodic Scheduling (Modulo Scheduling) FSA Based Method (Model hardware pipeline with sharing and hazards) Heuristic (Aiken88, AikenNic88, Ebcioglu89, etc.) Formal Model (GaoWonNin91) PLDI'91 In-Exact Method (Heuristic) (RauGla81, Lam88, RauEtAI92, Huff93, DehnertTow93, Rau94, WanEis93, StouchininEtAI00) Basic Formulation (DongenGao92) CONPAR'92 Register Optimal (Ninggao91, NingGao93, Ning93) POPL'93 Exact Method ILP based Resource Constrained (GovindAITGao94) Micro27 "Showdown" (RuttenbergGao StouchininWoody96) PLDI'96 Exhaustive Search EuroPar'96 FSA Co-Scheduling formulation (GovindAltmanGao'96) FSA Construction Method (GovindAltmanGao98) FSA Heuristic/optimization (ZhangGovindRyanGao99) Theory of Co-Scheduling (GovindAltmanGao00) Resource & Register (GovindAITGao95, Altman95, PLDI'95 EichenbergerDav95) (Altman95) 2008-04-24 \course\cpeg421-08s\topic-7.ppt 52

Advanced Topics Consider register constraints Realistic pipeline architecture constraints (with structural hazards) Loop body with conditionals Multi-Dimensional Loops 2008-04-24 \course\cpeg421-08s\topic-7.ppt 53