ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Size: px

Start display at page:

Download "ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018"

Robert Dalton
5 years ago
Views:

1 ECE 172 Digital Systems Chapter 12 Instruction Pipelining Herbert G. Mayer, PSU Status 7/20/2018 1

2 Syllabus l Scheduling on Pipelined Architecture l Idealized Pipeline l Goal of Scheduling l Causes for Dependences l Stalls and Hazards l Realistic Constraints l Reservation Tables l Collision Vector l Vertical Expansion l Horizontal Expansion l IBM Measurements l Summary l Bibliography 2

3 Scheduling on Pipelined Architecture l Pipelining: ancient computer system design method for accelerated processor execution, originally on large main frames, and long since on μp l Pipelining improves performance not by adding HW, but by separating individual HW modules of a uniprocessor (UP) architecture l Instead of designing one composite, complex piece of HW for a CPU, the architect for a pipelined μp designs a sequence of simpler and thus faster, consecutive modules l Ideally all i HW modules m i would be of similar complexity and have similar timing needs l These separate modules m i are significantly simpler each than the original composite, and execute overlapped, simultaneously progressing on more than one machine instruction at any one time 3

4 Scheduling on Pipelined Architecture l Instead of executing one complex instruction in one longer cycle required for some complex step, a pipelined architecture executes a sequence of multiple, simpler, faster, single-cycle sub-instructions l Each simple sub-instruction thus is faster to execute than any complete and complex single-instruction l Such single-cycle, pipelined sub-instructions are initiated once per short clock-cycle l Each instruction then progresses to completion while migrating through the various stages of separate hardware modules m i, called the pipeline l That pipeline takes cycles to fill (AKA to prime) and to flush; that is overhead cost 4

5 Scheduling on Pipelined Architecture Left: Traditional Hardware Architecture Right: equivalent Pipelined Architecture Decode Decode I-Fetch I-Fetch O1-Fetch O1-Fetch O2-Fetch O2-Fetch ALU op ALU op R Store R Store I-Fetch I-Fetch Decode Decode O1-Fetch O1-Fetch O2-Fetch O2-Fetch.. ALU op ALU op R Store R Store 5

6 Idealized Pipeline l Each Arithmetic Logic Unit (ALU) operation is broken into i separate, natural, sequential modules m i l Each of which can be initiated once per cycle, but with a way shorter, pipelined clock cycle than original, complex instruction with a slow clock l Each module m i is replicated in HW just once, like on a regular UP that is not pipelined l l l Note exceptions, when some module is used more than once by the original, non-pipelined instruction: OK to duplicate Example: normalize operation in FP instruction Or use of multiple FP-add in FP-multiply operations l Multiple modules operate in parallel on different instructions, at different stages of each instruction 6

7 Idealized Pipeline l Ideally and simplistically, all modules require unit time (1 cycle)! Ideal and simplistic only! l Ideally, all operations (fp-add, divide, fetch, store, increment by integer 1, etc.) require the same number of i steps to completion l But they do not! E.g. fp-divide takes way longer than, say, an integer increment, or a no-op l Differing numbers of cycles per instruction do cause different terminations l Operations may abort in intermediate stages, e.g. in case of a pipeline hazard, caused by: branch, call, return, conditional branch, exception l A pipelined operation also must stall in case of operand dependence 7

8 5-Stage Pipeline 8

9 5-Stage Pipeline, Alternate 9

10 Super Terms l Supercomputer: Main frame with very high clock rate. Clock may be artificially increased due to thermal cooling of processor. May be uni- or multi-processor l Superscalar: Uni-processor architecture that has some arithmetic-logical units replicated. HW can detect, whether 2 sequential instructions that happen to be independent: e.g. output of 1 st is not input of 2 nd. And if HW resources are available, such sequential instructions 1 and 2 can be executed simultaneously. Is still a scalar, uni-processor architecture, with some HW replication l Super Pipelined: Uni-processor that is pipelined, but the number of stages in unusually high, typically above a dozen stages. E.g. Intel Willamette architecture is superpipelined. Numerous stages create liability, as branch prediction must be exceedingly successful, lest pipes hold hazards 10

11 Superscalar 5-Stage Pipeline 11

12 Superpipelined 12

13 Goal of Scheduling, Obstacles to Scheduling 13

14 Goal of Scheduling l Goal: instruction completion at a rate way faster than would be possible without pipelining l Ideally CPI = 1, number of Cycles Per Instruction l Program completion time on pipelined architecture is shorter than on non-pipelined architecture, achieved by having separate hardware modules progress in parallel on multiple instructions at the same time l Pipelined instructions are retired in original, sequential order, or semantically equivalent order l Stalls and hazards must be minimized: via branch prediction l HW resolves dependence conflicts (hazards) via interlocking whenever branch prediction fails 14

15 Causes for Dependences l A load into register r i in one instruction, followed by use of register r i : True Dependence, AKA Flow Dependence l Load into register r i in one instruction (definition), followed by use of any register (if HW fails to check register id; e.g. early HP PA); no longer an issue on contemporary processors l Definition of register r i in one instruction (other than a load), followed by use of register r i ; AKA True Dependence l Store into memory followed by a load from memory; unless memory subsystems checks, whether the load comes from the same address as the earlier store; if not, no need to wait for store completion l So done in PCI-X protocols (HW protocol: Peripheral Component Interconnect, Extended) 15

16 Causes for Dependences l Use of register r i in one instruction, followed by load into that same register r i : Anti Dependence l Load into register r i in one instruction, followed by load into register r i in a later instruction with use of register r i in between: Output Dependence l We ll learn, both of these latter are false dependences l False dependences yield design opportunities for digital HW system designer! 16

17 Basic Block l Basic Block (BB) is CS technical term l BB is sequence of 1 instruction or more, with one single entry point, and one single exit point l Entry point can be fall through from previous instruction, or explicit branch to that instruction l Exit Point can be explicit branch away from here, or the following instruction may be an entry point (i.e. may be target of some branch) 17

18 Basic Block: Find Dependences -- result: is left operand after opcode, except for st -- other operands, if any, are sources -- Mxx addresses Memory at xx, implies indirection for ld -- Parens () in (Mxx) render indirection explicit -- 8(sp) means indirect through sp register, offset by #4 stands for literal value 4, decimal 1 ld r2, (M0) 2 add sp, r2, #12 3 st r0, (M1) 4 ld r3, -4(sp) 5 ld r4, -8(sp) 6 add sp, sp, #4 7 st r2, 0(sp) 8 ld r5, (M2) 9 add r4, r0, #1 18

19 Basic Block: Find Dependences 1-2 load of a register followed by use of that register 3-4 load from memory at -4(sp) while write to memory in progress (M1) 3-5 load from memory at -8(sp) while write to memory in progress (M1) 2-4, 2-5 define register sp, followed by use of same register; distance sufficient to avoid stall on typical architectures 6-7 define register sp before use; forces sequential execution, reduces pipelining 7-8 store followed by load! 8-9 load into register r5 followed by use of any register (on early simple architectures, e.g. early PA) 19

20 Stalls and Hazards l Hardware interlock slows down execution due to delay of unavailability of dependent operand in some instruction l Benefit of slowing down is: correct result! Slowing down AKA interlock l Cost is sequential execution with delay! l Programmer can sometimes re-arrange instructions or insert delays at select places l Compiler can re-schedule instructions, or insert delays l Unless programmer s or compiler s effort are provably complete, HW interlock must still be provided 20

21 Stalls and Hazards l CDC 6000 and IBM360/91 already used automatic hardware interlock l Advisable to have compiler re-schedule the instruction sequence, since re-ordering may minimize the number of interlocks needing to occur l Clearly depends on target architecture 21

22 Stalls and Hazards l Not all HW modules are being used exactly once and not only for one single cycle l Some HW modules m i are used more than once in one instruction; e.g. the normalizer in floating-point operations is used repeatedly l Basic Block analysis is insufficient to detect all stalls or hazards; may span separate Basic Blocks Reminder Basic Block: sequence of 1 instruction or more with single entry, single exit point; i.e. no intervening branches or branch targets! 22

23 Reservation Tables & Collision Vectors 23

24 Reservation Tables time progresses left-to-right, HW modules indexed m1 through m6 t1 t2 t3 t4 t5 t6 t7 t8 m6 i1 i2 i3 m5 i1 i2 i3 m4 i1 i2 i3 m3 i1 i2 I3 m2 i1 i2 i3 m1 i1 i2 i3 Table 1: Instructions i 1 to i 3 use 6 HW modules m i 24

25 Reservation Tables l Table 1, known as reservation table for pipelined HW, shows ideal schedule using hardware modules m 1 to m 6, required for execution of one instruction l Ideal: because each requires exactly 1 cycle per m i, and each HW module is used exactly once l Always 1 cycle is also unrealistic, yet simple for didactic purposes in class l Time to complete 3 instructions i 1, i 2, and i 3 is 8 cycles, while the time for any single instruction is 6 cycles, net time saving; better than 3 * 6 = 18 cycles! l Completion time for instruction during steady state is 1 cycle in pipelined architecture 25

26 Reservation Tables l On a non-pipelined architecture any of these 3 instructions would NOT necessarily take these same 6 cycles of the pipelined machine; perhaps 4 or 5 l Also for fairness sake it is not usual that 3 identical instructions are arranged one after the other l That simplistic model is used here merely to explain Reservation Tables 26

27 Reservation Tables Key learning: pipelined architecture does NOT speed up execution of a single instruction, may even slow down; but improves throughput of multiple instructions in a row: real benefit: during steady state t1 t2 t3 t4 t5 t6 t7 t8 m6 i1 i2 i3 m5 i1 i2 i3 m4 i1 i2 i3 m3 i1 i2 I3 m2 i1 i2 i3 m1 i1 i2 i3 Table 1 Repeated: Instructions i 1 to i 3 use 6 HW modules m i 27

28 Reservation Tables l Table 2 below shows more realistic schedule of HW modules, required by repeated instruction l In this schedule some modules are used back-toback; for example m 3 is used 3 times in a row, and m 6 4 times; typical in FP divide l But these cycles are contiguous; and even that is not always a realistic constraint l Instead, a HW module m i may be used at various moments during the execution of a single instruction l The schedule in Table 2 attempts to exploit the greedy approach for instruction i 2, initiated as soon as possible after i 1 l Note: does not always minimize completion time! 28

29 Reservation Tables l We could have let instruction i 2 start at cycle t 4 and no additional delay would be caused for or by m 3 l Or instruction i 2 could start at t 3 with just one additional delay due to m 3 l However, in both cases m 6 would cause a delay later on anyway l To schedule i 2 we must consider these multi-use resources, m 3 and m 6, that are in use continuously l In case of a load, the actual module would wait many more cycles, until the data arrive, but would not progress until then 29 t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10 t 11 t 12 t 13 t 14 t 15 t 16 t 17 t 18 t 19 m6 i1 i1 i1 i1 i2 i2 i2 i2 i3 i3 i3 i3 m5 i1 i2 d2 i3 m4 i1 i2 i3 m3 i1 i1 i1 i2 i2 i2 i3 i3 i3 m2 i1 i2 d2 d2 i3 m1 i1 i2 d d d d d d i3

30 Reservation Tables t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10 t 11 t 12 t 13 t 14 t 15 t 16 t 17 t 18 t 19 m6 i1 i1 i1 i1 i2 i2 i2 i2 i3 i3 i3 i3 m5 i1 i2 d2 i3 m4 i1 i2 i3 m3 i1 i1 i1 i2 i2 i2 i3 i3 i3 m2 i1 i2 d2 d2 i3 m1 i1 i2 d d d d d d i3 Table 2: Instructions i 1 to i 3 use 6 HW modules m j for 1..4 cycles 30

31 Reservation Tables l Instead of using the single resources m 3 and m 6 repeatedly and continuously, an architect can replicate them as many times as simultaneously needed in some HW operation l Replication costs more hardware, and does not speed up execution of one single instruction l For a single operation, all would still have to progress in sequence l But it avoids the delay of subsequent instruction start, that need the same HW module l See Table 3: shaded areas indicate the duplicated modules 31

32 Reservation Tables t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10 t 11 t 12 t 13 m6,4 i1 i2 I3 m6,3 i1 i2 i3 m6,2 i1 i2 i3 m6 i1 i2 i3 m5 i1 i2 i3 m4 i1 i2 i3 m3,3 i1 i2 I3 m3,2 i1 i2 i3 m3 i1 i2 i3 m2 i1 i2 i3 m1 i1 i2 i3 t 14 Table 3: Instructions i 1 to i 3 use replicated HW modules m 3 and m 6 32

33 Reservation Tables l These replicated circuits in Table 3 do not speed up the execution of any individual instruction l But by avoiding the delay for other instructions, a higher degree of parallelism is enabled, and multiple instructions can retire earlier l Even this is unrealistically simplistic l Some of the modules m i are used for more than one cycle, but not necessarily in sequence l Instead, a Reservation Table offers a more realistic representation l Use Reservation Table in Table 4 to figure out, how closely the same instruction can be scheduled back-to-back 33

34 Collision Vector CV l Collision vector identifies, how soon 2 instructions can be scheduled successively, one after the other: l Best case in general: the next identical instruction can be scheduled at the next cycle! l Worst case in general: next instruction must be scheduled n cycles after the start of the first, with first requiring n cycles to complete! l Goal for HW designer: find, how many instructions can be initiated between start and completion? l To analyze this for speed, we use the Reservation Table and Collision Vector (CV) 34

35 Collision Vector CV l Goal: Find CV by overlapping two identical Reservation Tables (e.g. plastic transparencies ) within the window of the cycles of one operation l If, after shifting a second, duplicate transparency with i = 1..n-1 time steps, two resource-marks of a row land on same field, we have a collision: both instructions claim a resource at the same time! l Collision means: the second instructing cannot yet be scheduled. So mark field i in the CV with a 1 l Otherwise mark field i with a 0, or leave blank l Do so n-1 times, and the CV is complete. But do check for all rows, i.e. for all HW modules m j 35

36 Collision Vector CV t1 t2 t3 t4 t5 t6 t7 m1 X m2 X X m3 X X m4 X X m5 X Table 4: Reservation Table for 7-step, 5-Module instruction Table 5: Find Collision Vector for above instruction Collision Vector has n-1 entries for some n-cycle instruction 36

37 Collision Vector CV l If a second instruction of the kind shown in Table 4 were initiated 1 cycle after first, resource m 2 will cause a conflict l Because instruction 2 requires m 2 at cycles 3 and 4 l However, instruction 1 is already using m 2 at cycles 2 and 3 l At step 3 arises resource conflict l Also resource m 3 would cause a conflict l The good news, however, is that this doubleconflict causes no further entry in CV 37

38 Collision Vector CV l Similarly, a new instruction cannot be initiated 2 cycles after the start of the first l This is, because a second instruction requires m 4 at cycles t 7 and t 9 l However, instruction 1 is already using m 4 at t 5 and t 7. At step t 7 there would be a conflict l At all other steps a second instruction may be initiated. See the completed CV in Table 6 below: Table 6: Collision Vector for above 7-cycle, 5-module instruction 38

39 Reservation Table: Main Example l The next example is an abstract instruction of a hypothetical microprocessor, characterized by the Reservation Table 7, it is 7 cycles long, using 4 HW modules m1 to m4 l Analysis:... t1 t2 t3 t4 t5 t6 t7 m1 X X X m2 X X m3 X X m4 X Table 7: Reservation Table 7 for 7-cycle, 4-module Main Example 39

40 Reservation Table: Main Example l The Collision Vector for the Main Example says: We can start new instruction of the same kind at step t 6 or t 7 l Of course, we can always start a new instruction, identical or of another type, after the current one has completed; no resource will be in use then l Challenge is to start another, while the current is still executing, to maximize parallel execution one has completed. No resource will be in use then. The challenge is to 1 start 1 another 1 one 1 while 0 the 0 current is executing Table 8: Collision Vector for Main Example l Show, that by adding delays we can sometimes speed up execution of pipelined ops! l That is processor architecture beauty! to start another one while the current is executing 40

41 Main Example Pipelined l For Main Example, initiate a second, pipelined instruction Y at step t 6, i.e. 5 cycles after start of X l Greedy Approach to pipeline X and Y as follows: t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10 m1 X X X Y Y Y Z m2 X X Y Y Z m3 X X Y Y m4 X Y t 11 t 12 Table 9: Pipelining 2 Instructions of Main Example l Observe two-cycle overlap, achievable speed-gain l Starting Y earlier (greedy approach) would create delays, but not retire Y any earlier 41

42 Main Example Pipelined l The 3 rd pipelined instruction Z can start at time step t 11, by which time the first X is retired, the second, named Y, is partly through l The fourth instruction can start at step t 16, etc. t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10 m1 X X X Y Y Y Z Z Z m2 X X Y Y Z Z m3 X X Y Y Z Z m4 X Y Z t 11 t 12 t 13 t 14 t 15 t 16 t 17 Table 10: Pipelining 3 Instructions of Main Example 42

43 Main Example Pipelined l Though the Reservation Table, Table 7 for the Main Example is sparsely populated, a high degree of pipelining is not possible l The maximum overlap is 2 cycles l Can one infer this low degree of pipelining from the Collision Vector alone? During pipelining we achieve 5 cycles per instruction retirement in the steady state, cpi = 5 l That means 5 cycles per completion of an instruction, assuming the same instruction is executed over and over again, once the steady state is reached! t1 t2 t3 t4 t5 t6 t7 m1 X X X m2 X X m3 X X m4 X l We ll come back to the Main Example and analyze it after further study of Examples 2 and 3 43

44 Pipeline Example 2 l Reservation Table for Example 2 has 7 entries, i.e. 7 X-es, 24 fields, 6 steps, density = Main Example had 8 entries in 28 fields, density = l We ll attempt to pipeline as many identical Example 2 instructions as possible t1 t2 t3 t4 t5 t6 m1 X X m2 X X m3 X m4 X X Table 11 : Reservation Table for Example 2 Table 12 : Collision Vector for Example 2 Figured out by Students 44

45 Pipeline Example 2, With CV l Reservation Table for Example 2 has 7 entries, i.e. 7 X-es, 24 fields, 6 steps, density = Main Example had 8 entries in 28 fields, density = l We ll attempt to pipeline as many identical Example 2 instructions as possible t1 t2 t3 t4 t5 t6 m1 X X m2 X X m3 X m4 X X Table 11: Reservation Table for Example Table 12: Collision Vector for Example 2 45

46 Pipeline Example 2 l The Collision Vector suggests to initiate a new pipelined instruction at time t 3, t 5, t 7, etc. l That would allow 3 instructions X, Y, and Z simultaneously, overlapped, pipelined. By step t 7 the first instruction would already be retired t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 m1 X Y Z X A Y B Z A B m2 X Y X Z Y A Z B A B m3 X Y Z A B m4 X X Y Y Z Z A A B B Table 13: Schedule for Pipelining Example 2 46

47 Pipeline Example 2 l Example 2 is lucky to pipeline 3 identical instructions at the same time l Caution: The CV is not a direct indicator. The reader was mildly misled to make inferences that don t strictly follow l However, if all positions in the CV were marked 1, there would be no pipelining l For Example 2 the number of cycles per instruction retirement is an amazing cpi = 2 l Even though the operation density is slightly higher than in the Main Example, the pipelining overlap in Example 2 is significantly higher, which is counter-intuitive! l On to Example 3! 47

48 Pipeline Example 3 l Interesting to see Example 3, analyzing the Collision Vector, to see how much we can parallelize! l The Reservation Table has numerous resource fields filled, yet the Collision Vector is sparser than the one in Example 2 t1 t2 t3 t4 t5 t6 m1 X X X m2 X X X m3 X X X m4 X X X Table 14: Reservation Table for Example Table 15: Collision Vector for Example 3 48

49 Pipeline Example 3 l The Collision Vector (CV) suggests to start new pipelined instruction 1, 3, or 5 cycles after initiation of the first l CV of Example 3 is less densely packed with 1s than Example 2, where we could overlap 3 identical instructions and get a rate of cpi = 2 l Goal now: find the best cpi rate for Example 3 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 m1 X Y X Y X Y X1 Y2 X1 Y2 X1 Y2 m2 X Y X Y X Y X1 Y2 X1 Y2 X1 m3 X Y X Y X Y X1 Y2 X1 Y2 X1 Y2 m4 X Y X Y X Y X1 Y2 X1 Y2 X1 Table 16: Schedule for Pipelining Example 3 49

50 Pipeline Example 3 l Example 2 earlier with Collision Vector allows a higher degree of pipelining l In Example 3, cpi = 3, every 6 cycles two instructions can retire l Contrast to cpi = 2 of Example 2 l Reason for the lower retirement rate is clear: l All 4 HW modules are used every other cycle by one of two instructions, thus one cannot overlap more than twice! l The non-pipelined cpi rate for Example 3 is cpi = 6, the pipelined rate is cpi = 3 50

51 Vertical Expansion for Example 3 l If we need higher degree of pipelining for Example 3 with a fill-factor of 0.5, we must pay! Vertically with more hardware, or horizontally with more time for added delays l Let s analyze a vertically expanded Reservation Table now with 8 Modules; every hardware resource m 1 to m 4 replicated once; lower new density = 0.25 t1 t2 t3 t4 t5 t6 m1 X X m2 X X m3 X m4 X m1,2 X m2,2 X m3,2 X X m4,2 X X Table 17: Reservation Table Example 3 with Replicated HW 51

52 Vertical Expansion for Example 3 l Let us pipeline multiple identical instructions for Reservation Table 17 as densely as possible l With twice the HW, can we overlap perhaps twice as much? The previous rate with half the hardware was cpi = 3. Ideal would be cpi = 1.5; a plausible schedule is shown in Table 18 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 m1 X Y Z A X Y Z A X Y Z A m2 X Y Z A X Y Z A X Y Z m3 X Y Z A X Y m4 X Y Z A X m1,2 X Y Z A X Y m2,2 X Y Z A X m3,2 X Y Z A X Y Z A X Y Z m4,2 X Y Z A X Y Z A X Y Table 18: Schedule for Pipelining Example 3 52

53 Vertical Expansion for Example 3 l Initiation and retirement rates are 4 instructions per 8 cycles, cpi = 2 l This is, as suspected, better than the rate of the original Example 3, not surprising with double the hardware modules l But this is not twice as good a retirement rate, despite twice the HW l Original rate was cpi = 3, the improved rate with double the hardware is cpi = 2 53

54 Horizontal Expansion, Main Example l Next case, a variation of the Main Example, shows an expansion of the Reservation Table horizontally; isn t this counter-intuitive? l I.e. delays are built-in; HW modules are kept constant l Only the 4 modules m 1 to m 4 from Main Example are provided l Motivation of an architect: if a delay can speed up execution, by all means build it in: delays are cheap! l Common sense tells us, delays tend to slow-down; real HW architect looks also at counter-intuitive situations 54

55 Horizontal Expansion, Main Example l After Examples 2 and 3, we expand the Main Example, repeated below, by adding delays, AKA Horizontal Expansion l If we insert delay cycles, clearly execution for a single instruction will slow down l However, if this yields a sufficient increase in the overall degree of pipelining, more parallelism, it may still be a win l Building circuits to delay an instruction is low cost l We analyze this variation next: 55

56 Horizontal Expansion, Main Example t1 t2 t3 t4 t5 t6 t7 m1 X X X m2 X X m3 X X m4 X Table 19: Original Reservation Table for Main Example Inserting a Delay Cycle after t 3, will be new step t 4 56

57 Horizontal Expansion, Main Example l We ll insert delays; but where? A systematic way to compute optimum position is not shown here l Instead, we ll suggest a sample position for a single delay and analyze the performance impact l Table 20 shows delay inserted after t 3, new t 4 t1 t2 t3 t4 t5 t6 t7 t8 m1 X X X m2 X X m3 X X m4 X Table 20: Reservation Table for Main Example with 1 Delay, at t Table 21: Collision Vector for Main Example with 1 Delay 57

58 Horizontal Expansion, Main Example l The Greedy Approach is to schedule instruction Y as soon as possible, when the CV has a 0 entry l This would lead us to initiate a second instruction Y at time step t 2, one cycle after instruction X. Is this optimal? t1 t2 t3 t4 t5 t6 t7 t8 t9 t1 0 m1 X Y X Y X Y Z A Z A Z A m2 X Y X Y Z A Z A m3 X Y X Y Z A Z A m4 X Y Z A t1 1 t1 2 t1 3 t1 4 t1 5 t1 6 Table 22: Schedule for Main Example Pipelined Instructions X, Y, Z, A With Delay Slot, Using Greedy Approach 58

59 Horizontal Expansion, Main Example l Initiation and retirement rates are 2 instructions every 7 cycles, or cpi = 3.5; see the purple header at each retired instruction in Table 23 l This is already better than cpi = 5 for the original Main Example without the delay l Hence we have shown that adding delays can speed up throughput of pipelined instructions l But can we do better? l After all, we have only tried out the first sample of a Greedy Approach! l Careful: Greedy smacks of short-sightedness, something EEs have to avoid 59

60 Horizontal Expansion, Main Example l In this experiment we start the second instruction at cycle t 4, three cycles after the start of the first l Which cpi rate shall we get? See Table 23 t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10 m1 X X Y X Y Z Y Z X Z X Y X Y Z Y Z m2 X Y X Z Y X Z Y X Z m3 X X Y Y Z Z X X Y m4 X Y Z X Y t 11 t 12 t 13 t 14 t 15 t 16 t 17 t 18 Table 23: Another Schedule for Pipelined Main Example, with delay Initiation later than first opportunity in Table 22 Result: better throughput: Message: Starting later can improve performance! 60

61 Horizontal Expansion, Main Example l Patient schedule of Table 23 completes one identical instruction every 3 cycles in the steady state l Purple cells indicate instruction moment of retirement l X retires at completion t 8, Y after t 11, and Z after t 14 l Then X again after t 17 l Now cpi = 3 with the Not-So-Greedy Approach l Key learning: To speed up pipelined execution, one can sometimes enhance throughput by adding delay circuits, or by replicating hardware, or postponing instruction initiation, or a combination l The greedy approach is not necessarily optimal l The collision vector only states, when one cannot initiate a new instruction (value 1); a 0 value is not a solid hint for initiating a new instruction 61

62 IBM Measurements Agerwala and Cocke 1987; see [1] l Memory Bandwidth: l 1 word/cycle to fetch 1 instruction/cycle from I-cache l 40% of instructions are memory-accesses (load-store) l Those could all benefit from access to D-cache l Code Characteristics, dynamic: l ~25% of all instructions: loads l ~15% of all instructions: stores l ~40% of all instructions: ALU/RR l ~20% of all instructions: Branches 1/3 unconditional 1/3 conditional taken 1/3 conditional not taken 62

63 How Can Pipelining Work? l About 1 out of 4 or 5 instructions will be branches l Branches include all transfer of control instructions; these are: call, return, unconditional and conditional branch, abort, exception and similar machine instructions l If processor pipeline is deeper than say, 5 stages, there will almost always be a branch in the pipe, rendering several perfected operations useless l Some processors (e.g. Intel Willamette, [6]) have over 20 stages. For this type of processor, regular pipelining would practically always cause a stall! 63

64 How Can Pipelining Work? l Remedy is branch prediction l If the processor knows dynamically, from which address to fetch, instead of blindly assuming the subsequent code address pc+1, this would eliminate most pipeline flushes l Luckily, branch prediction in the 2010s has become > 97% accurate, causing only rarely the need to re-prime the pipe l Also, processors no longer are designed with the deep pipeline of the Willamette, of 20+ stages l Here we see interesting interactions of several computer architecture principles: pipelining and branch prediction, one helping the other to become exceedingly advantageous 64

65 Summary l Pipelining can speed up execution l Yet the speedup is not due to the faster clock rate l That fast clock rate manipulates significantly simpler sub-instructions and cannot be equated with original, i.e. non-pipelined clock l Counter-intuitively, pipelining may even benefit from inserting delays (at the right places) l May also benefit from initiating an instruction later than possible l And benefits, not surprisingly, from added HW resources l Branch prediction is a necessary architecture attribute to make pipelining work fast 65

66 Bibliography 1. Cocke and Schwartz, Programming Languages and their Compilers, unpublished, 1969, portal.acm.org/itation.cfm?id= Harold Stone, High Performance Computer Architecture, 1993 AW 3. cpi rate: Cycles_per_instruction 4. Introduction to PCI: articles/computer-science/protocol/introduction-topci-protocol/ 5. Wiki PCI page: Conventional_PCI 6. NetBurst_(microarchitecture) 7. PCI-X: 66

67 Some Definitions 67

68 Basic Block Definitions l Sequence of one instruction or more with a single entry point and a single exit point l Entry point may be the destination of a branch, a fall through from a conditional branch, or the program entry point; i.e. destination of an OS jump l Exit point may be an unconditional branch instruction, a call, a return, or a fall-through l Fall-through means: one instruction is a conditional flow of control change, and the subsequent instruction is executed by default, if the change in control flow does not take place l Or fall-through can mean: The successor of the exit point is a branch or call target 68

69 Definitions Collision Vector l Observation: An instruction requiring n cycles to completion may be initiated a second time n cycles after the first without possibility of conflict l For each of the n-1 cycles before that, a further instruction of identical type causes a resource conflict, if initiated l The Boolean vector of length n-1 that represents this fact stating whether or not re-issue is possible is referred to as collision vector l It can be derived from the Reservation Table 69

70 Definitions Cycles Per Instruction: cpi l cpi quantifies how long (how many cycles) it takes for a single instruction to execute l Generally, the number of execution cycles per instruction is > 1 on a CISC architecture l However, on a pipelined UP architecture, where new instruction is initiated each cycle, it is conceivable to reach cpi rate of 1; assuming no hazards l Note different meanings of cycle! l On a UP pipelined architecture cpi rate cannot shrink below one l Yet on an MP or superscalar architecture, the cpi rate may be < 1 70

71 Definitions Dependence l If the logic of the underlying program imposes an order between two instructions, there exists dependence -data or other dependence- between them l Generally, the order of execution cannot be permuted l Conventional in Computer Engineering to call this dependence, not dependency 71

72 Definitions Early Pipelined Computers/Processors: 1. CDC 6000 Series of the late 1960s 2. CDC Cyber series of the 1970s 3. IBM 360/91 series 4. Intel Pentium IV or Xeon TM processor families of the 1990s 72

73 Definitions Flushing l When a hazard occurs due to a change in flow of control, the partially execution instructions after the hazard are discarded l This discarding is called flushing l Antonym: priming l Flushing is not needed in case of a stall caused by dependences; waiting instead will resolve this 73

74 Hazard Definitions l Instruction i+1 is pre-fetched under the assumption it would be executed after instruction i l Yet after decoding i it becomes clear that that operation i is a control-transfer operation l Hence subsequently pre-fetched instructions i+1 and on are wasted l This is called a hazard l A hazard causes part of the pipeline to be flushed, while a stall (caused by data dependence) also causes a delay, but a simple wait will resolve such a stall conflict 74

75 ILP Definitions l Instruction Level Parallelism: Architectural attribute, allowing multiple instructions to be executed at the same time l Related: Superscalar 75

76 Interlock Definitions l If HW detects a conflict during execution of instructions i and j and i was initiated earlier, such a conflict, called a stall, delays execution of some j and perhaps subsequent instructions l Interlock is the architecture s way to respond to and resolve a stall at the expense of degraded performance l Advantage: computation of correct result! l Synonym: stall or wait 76

77 IPC Definitions l Instructions per cycle: A measure for Instruction Level Parallelism. How many different instructions are being executed not necessarily to completion during one single cycle? l Desired to have an IPC rate > 1 l Ideally, given suitable parallelism, IPC >> 1 l On conventional, non-pipelined UP CISC architectures it is typical to have IPC << 1 77

78 Definitions Pipelining l Mode of execution, in which one instruction is initiated every cycle and ideally one retires every cycle, even though each requires multiple (possibly many) cycles to complete l Highly pipelined Xeon processors, for example, have a greater than 20-stage pipeline 78

79 Definitions Prefetch (Instruction Prefetch) l Bringing an instruction to the execution engine before it is reached by the instruction pointer (ip) is called instruction prefetch l Generally this is done, because some other knowledge exists proving that the instruction will likely be executed soon l Possible to have branch in between, in which case the prefetch may have been wasted 79

80 Priming Definitions l Filling the various modules of a pipelined processor (the stages) with different instructions to the point of retirement of the first instruction is called priming l Antonym: flushing 80

81 Register Definition Definitions l If an arithmetic or logical operation places the result into register r i we say that r i is being defined l Synonym: Writing a register l Antonym: Register use 81

82 Definitions Reservation Table l Table that shows, which hardware resource i (AKA module m i ) is being used at which cycle in a multi-cycle instruction l Typically, an X written in the Reservation Table Matrix indicates use l Empty field indicates the corresponding resource is free during that cycle 82

83 Retire Definitions l When all parts of an instruction have successfully migrated through all execution stages, that instruction is complete l Hence, it can be discarded, this is called being retired l All results have been posted 83

84 Stall Definitions l If instruction i requires an operand o that is being computed by another instruction j, and j is not complete when i needs o, there exists dependence between the i and j, the wait thus created is called stall l A stall prevents the two instructions from being executed simultaneously, since the instruction at step i must wait for the other to complete. See also: hazard, interlock l Stall can also be caused by HW resource conflict: Some earlier instruction i may use HW resource m, while another instruction j needs m l Generally j has to wait until i frees m, causing a stall for j 84

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)