ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

Size: px
Start display at page:

Download "ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018"

Transcription

1 ECE 172 Digital Systems Chapter 12 Instruction Pipelining Herbert G. Mayer, PSU Status 7/20/2018 1

2 Syllabus l Scheduling on Pipelined Architecture l Idealized Pipeline l Goal of Scheduling l Causes for Dependences l Stalls and Hazards l Realistic Constraints l Reservation Tables l Collision Vector l Vertical Expansion l Horizontal Expansion l IBM Measurements l Summary l Bibliography 2

3 Scheduling on Pipelined Architecture l Pipelining: ancient computer system design method for accelerated processor execution, originally on large main frames, and long since on μp l Pipelining improves performance not by adding HW, but by separating individual HW modules of a uniprocessor (UP) architecture l Instead of designing one composite, complex piece of HW for a CPU, the architect for a pipelined μp designs a sequence of simpler and thus faster, consecutive modules l Ideally all i HW modules m i would be of similar complexity and have similar timing needs l These separate modules m i are significantly simpler each than the original composite, and execute overlapped, simultaneously progressing on more than one machine instruction at any one time 3

4 Scheduling on Pipelined Architecture l Instead of executing one complex instruction in one longer cycle required for some complex step, a pipelined architecture executes a sequence of multiple, simpler, faster, single-cycle sub-instructions l Each simple sub-instruction thus is faster to execute than any complete and complex single-instruction l Such single-cycle, pipelined sub-instructions are initiated once per short clock-cycle l Each instruction then progresses to completion while migrating through the various stages of separate hardware modules m i, called the pipeline l That pipeline takes cycles to fill (AKA to prime) and to flush; that is overhead cost 4

5 Scheduling on Pipelined Architecture Left: Traditional Hardware Architecture Right: equivalent Pipelined Architecture Decode Decode I-Fetch I-Fetch O1-Fetch O1-Fetch O2-Fetch O2-Fetch ALU op ALU op R Store R Store I-Fetch I-Fetch Decode Decode O1-Fetch O1-Fetch O2-Fetch O2-Fetch.. ALU op ALU op R Store R Store 5

6 Idealized Pipeline l Each Arithmetic Logic Unit (ALU) operation is broken into i separate, natural, sequential modules m i l Each of which can be initiated once per cycle, but with a way shorter, pipelined clock cycle than original, complex instruction with a slow clock l Each module m i is replicated in HW just once, like on a regular UP that is not pipelined l l l Note exceptions, when some module is used more than once by the original, non-pipelined instruction: OK to duplicate Example: normalize operation in FP instruction Or use of multiple FP-add in FP-multiply operations l Multiple modules operate in parallel on different instructions, at different stages of each instruction 6

7 Idealized Pipeline l Ideally and simplistically, all modules require unit time (1 cycle)! Ideal and simplistic only! l Ideally, all operations (fp-add, divide, fetch, store, increment by integer 1, etc.) require the same number of i steps to completion l But they do not! E.g. fp-divide takes way longer than, say, an integer increment, or a no-op l Differing numbers of cycles per instruction do cause different terminations l Operations may abort in intermediate stages, e.g. in case of a pipeline hazard, caused by: branch, call, return, conditional branch, exception l A pipelined operation also must stall in case of operand dependence 7

8 5-Stage Pipeline 8

9 5-Stage Pipeline, Alternate 9

10 Super Terms l Supercomputer: Main frame with very high clock rate. Clock may be artificially increased due to thermal cooling of processor. May be uni- or multi-processor l Superscalar: Uni-processor architecture that has some arithmetic-logical units replicated. HW can detect, whether 2 sequential instructions that happen to be independent: e.g. output of 1 st is not input of 2 nd. And if HW resources are available, such sequential instructions 1 and 2 can be executed simultaneously. Is still a scalar, uni-processor architecture, with some HW replication l Super Pipelined: Uni-processor that is pipelined, but the number of stages in unusually high, typically above a dozen stages. E.g. Intel Willamette architecture is superpipelined. Numerous stages create liability, as branch prediction must be exceedingly successful, lest pipes hold hazards 10

11 Superscalar 5-Stage Pipeline 11

12 Superpipelined 12

13 Goal of Scheduling, Obstacles to Scheduling 13

14 Goal of Scheduling l Goal: instruction completion at a rate way faster than would be possible without pipelining l Ideally CPI = 1, number of Cycles Per Instruction l Program completion time on pipelined architecture is shorter than on non-pipelined architecture, achieved by having separate hardware modules progress in parallel on multiple instructions at the same time l Pipelined instructions are retired in original, sequential order, or semantically equivalent order l Stalls and hazards must be minimized: via branch prediction l HW resolves dependence conflicts (hazards) via interlocking whenever branch prediction fails 14

15 Causes for Dependences l A load into register r i in one instruction, followed by use of register r i : True Dependence, AKA Flow Dependence l Load into register r i in one instruction (definition), followed by use of any register (if HW fails to check register id; e.g. early HP PA); no longer an issue on contemporary processors l Definition of register r i in one instruction (other than a load), followed by use of register r i ; AKA True Dependence l Store into memory followed by a load from memory; unless memory subsystems checks, whether the load comes from the same address as the earlier store; if not, no need to wait for store completion l So done in PCI-X protocols (HW protocol: Peripheral Component Interconnect, Extended) 15

16 Causes for Dependences l Use of register r i in one instruction, followed by load into that same register r i : Anti Dependence l Load into register r i in one instruction, followed by load into register r i in a later instruction with use of register r i in between: Output Dependence l We ll learn, both of these latter are false dependences l False dependences yield design opportunities for digital HW system designer! 16

17 Basic Block l Basic Block (BB) is CS technical term l BB is sequence of 1 instruction or more, with one single entry point, and one single exit point l Entry point can be fall through from previous instruction, or explicit branch to that instruction l Exit Point can be explicit branch away from here, or the following instruction may be an entry point (i.e. may be target of some branch) 17

18 Basic Block: Find Dependences -- result: is left operand after opcode, except for st -- other operands, if any, are sources -- Mxx addresses Memory at xx, implies indirection for ld -- Parens () in (Mxx) render indirection explicit -- 8(sp) means indirect through sp register, offset by #4 stands for literal value 4, decimal 1 ld r2, (M0) 2 add sp, r2, #12 3 st r0, (M1) 4 ld r3, -4(sp) 5 ld r4, -8(sp) 6 add sp, sp, #4 7 st r2, 0(sp) 8 ld r5, (M2) 9 add r4, r0, #1 18

19 Basic Block: Find Dependences 1-2 load of a register followed by use of that register 3-4 load from memory at -4(sp) while write to memory in progress (M1) 3-5 load from memory at -8(sp) while write to memory in progress (M1) 2-4, 2-5 define register sp, followed by use of same register; distance sufficient to avoid stall on typical architectures 6-7 define register sp before use; forces sequential execution, reduces pipelining 7-8 store followed by load! 8-9 load into register r5 followed by use of any register (on early simple architectures, e.g. early PA) 19

20 Stalls and Hazards l Hardware interlock slows down execution due to delay of unavailability of dependent operand in some instruction l Benefit of slowing down is: correct result! Slowing down AKA interlock l Cost is sequential execution with delay! l Programmer can sometimes re-arrange instructions or insert delays at select places l Compiler can re-schedule instructions, or insert delays l Unless programmer s or compiler s effort are provably complete, HW interlock must still be provided 20

21 Stalls and Hazards l CDC 6000 and IBM360/91 already used automatic hardware interlock l Advisable to have compiler re-schedule the instruction sequence, since re-ordering may minimize the number of interlocks needing to occur l Clearly depends on target architecture 21

22 Stalls and Hazards l Not all HW modules are being used exactly once and not only for one single cycle l Some HW modules m i are used more than once in one instruction; e.g. the normalizer in floating-point operations is used repeatedly l Basic Block analysis is insufficient to detect all stalls or hazards; may span separate Basic Blocks Reminder Basic Block: sequence of 1 instruction or more with single entry, single exit point; i.e. no intervening branches or branch targets! 22

23 Reservation Tables & Collision Vectors 23

24 Reservation Tables time progresses left-to-right, HW modules indexed m1 through m6 t1 t2 t3 t4 t5 t6 t7 t8 m6 i1 i2 i3 m5 i1 i2 i3 m4 i1 i2 i3 m3 i1 i2 I3 m2 i1 i2 i3 m1 i1 i2 i3 Table 1: Instructions i 1 to i 3 use 6 HW modules m i 24

25 Reservation Tables l Table 1, known as reservation table for pipelined HW, shows ideal schedule using hardware modules m 1 to m 6, required for execution of one instruction l Ideal: because each requires exactly 1 cycle per m i, and each HW module is used exactly once l Always 1 cycle is also unrealistic, yet simple for didactic purposes in class l Time to complete 3 instructions i 1, i 2, and i 3 is 8 cycles, while the time for any single instruction is 6 cycles, net time saving; better than 3 * 6 = 18 cycles! l Completion time for instruction during steady state is 1 cycle in pipelined architecture 25

26 Reservation Tables l On a non-pipelined architecture any of these 3 instructions would NOT necessarily take these same 6 cycles of the pipelined machine; perhaps 4 or 5 l Also for fairness sake it is not usual that 3 identical instructions are arranged one after the other l That simplistic model is used here merely to explain Reservation Tables 26

27 Reservation Tables Key learning: pipelined architecture does NOT speed up execution of a single instruction, may even slow down; but improves throughput of multiple instructions in a row: real benefit: during steady state t1 t2 t3 t4 t5 t6 t7 t8 m6 i1 i2 i3 m5 i1 i2 i3 m4 i1 i2 i3 m3 i1 i2 I3 m2 i1 i2 i3 m1 i1 i2 i3 Table 1 Repeated: Instructions i 1 to i 3 use 6 HW modules m i 27

28 Reservation Tables l Table 2 below shows more realistic schedule of HW modules, required by repeated instruction l In this schedule some modules are used back-toback; for example m 3 is used 3 times in a row, and m 6 4 times; typical in FP divide l But these cycles are contiguous; and even that is not always a realistic constraint l Instead, a HW module m i may be used at various moments during the execution of a single instruction l The schedule in Table 2 attempts to exploit the greedy approach for instruction i 2, initiated as soon as possible after i 1 l Note: does not always minimize completion time! 28

29 Reservation Tables l We could have let instruction i 2 start at cycle t 4 and no additional delay would be caused for or by m 3 l Or instruction i 2 could start at t 3 with just one additional delay due to m 3 l However, in both cases m 6 would cause a delay later on anyway l To schedule i 2 we must consider these multi-use resources, m 3 and m 6, that are in use continuously l In case of a load, the actual module would wait many more cycles, until the data arrive, but would not progress until then 29 t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10 t 11 t 12 t 13 t 14 t 15 t 16 t 17 t 18 t 19 m6 i1 i1 i1 i1 i2 i2 i2 i2 i3 i3 i3 i3 m5 i1 i2 d2 i3 m4 i1 i2 i3 m3 i1 i1 i1 i2 i2 i2 i3 i3 i3 m2 i1 i2 d2 d2 i3 m1 i1 i2 d d d d d d i3

30 Reservation Tables t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10 t 11 t 12 t 13 t 14 t 15 t 16 t 17 t 18 t 19 m6 i1 i1 i1 i1 i2 i2 i2 i2 i3 i3 i3 i3 m5 i1 i2 d2 i3 m4 i1 i2 i3 m3 i1 i1 i1 i2 i2 i2 i3 i3 i3 m2 i1 i2 d2 d2 i3 m1 i1 i2 d d d d d d i3 Table 2: Instructions i 1 to i 3 use 6 HW modules m j for 1..4 cycles 30

31 Reservation Tables l Instead of using the single resources m 3 and m 6 repeatedly and continuously, an architect can replicate them as many times as simultaneously needed in some HW operation l Replication costs more hardware, and does not speed up execution of one single instruction l For a single operation, all would still have to progress in sequence l But it avoids the delay of subsequent instruction start, that need the same HW module l See Table 3: shaded areas indicate the duplicated modules 31

32 Reservation Tables t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10 t 11 t 12 t 13 m6,4 i1 i2 I3 m6,3 i1 i2 i3 m6,2 i1 i2 i3 m6 i1 i2 i3 m5 i1 i2 i3 m4 i1 i2 i3 m3,3 i1 i2 I3 m3,2 i1 i2 i3 m3 i1 i2 i3 m2 i1 i2 i3 m1 i1 i2 i3 t 14 Table 3: Instructions i 1 to i 3 use replicated HW modules m 3 and m 6 32

33 Reservation Tables l These replicated circuits in Table 3 do not speed up the execution of any individual instruction l But by avoiding the delay for other instructions, a higher degree of parallelism is enabled, and multiple instructions can retire earlier l Even this is unrealistically simplistic l Some of the modules m i are used for more than one cycle, but not necessarily in sequence l Instead, a Reservation Table offers a more realistic representation l Use Reservation Table in Table 4 to figure out, how closely the same instruction can be scheduled back-to-back 33

34 Collision Vector CV l Collision vector identifies, how soon 2 instructions can be scheduled successively, one after the other: l Best case in general: the next identical instruction can be scheduled at the next cycle! l Worst case in general: next instruction must be scheduled n cycles after the start of the first, with first requiring n cycles to complete! l Goal for HW designer: find, how many instructions can be initiated between start and completion? l To analyze this for speed, we use the Reservation Table and Collision Vector (CV) 34

35 Collision Vector CV l Goal: Find CV by overlapping two identical Reservation Tables (e.g. plastic transparencies ) within the window of the cycles of one operation l If, after shifting a second, duplicate transparency with i = 1..n-1 time steps, two resource-marks of a row land on same field, we have a collision: both instructions claim a resource at the same time! l Collision means: the second instructing cannot yet be scheduled. So mark field i in the CV with a 1 l Otherwise mark field i with a 0, or leave blank l Do so n-1 times, and the CV is complete. But do check for all rows, i.e. for all HW modules m j 35

36 Collision Vector CV t1 t2 t3 t4 t5 t6 t7 m1 X m2 X X m3 X X m4 X X m5 X Table 4: Reservation Table for 7-step, 5-Module instruction Table 5: Find Collision Vector for above instruction Collision Vector has n-1 entries for some n-cycle instruction 36

37 Collision Vector CV l If a second instruction of the kind shown in Table 4 were initiated 1 cycle after first, resource m 2 will cause a conflict l Because instruction 2 requires m 2 at cycles 3 and 4 l However, instruction 1 is already using m 2 at cycles 2 and 3 l At step 3 arises resource conflict l Also resource m 3 would cause a conflict l The good news, however, is that this doubleconflict causes no further entry in CV 37

38 Collision Vector CV l Similarly, a new instruction cannot be initiated 2 cycles after the start of the first l This is, because a second instruction requires m 4 at cycles t 7 and t 9 l However, instruction 1 is already using m 4 at t 5 and t 7. At step t 7 there would be a conflict l At all other steps a second instruction may be initiated. See the completed CV in Table 6 below: Table 6: Collision Vector for above 7-cycle, 5-module instruction 38

39 Reservation Table: Main Example l The next example is an abstract instruction of a hypothetical microprocessor, characterized by the Reservation Table 7, it is 7 cycles long, using 4 HW modules m1 to m4 l Analysis:... t1 t2 t3 t4 t5 t6 t7 m1 X X X m2 X X m3 X X m4 X Table 7: Reservation Table 7 for 7-cycle, 4-module Main Example 39

40 Reservation Table: Main Example l The Collision Vector for the Main Example says: We can start new instruction of the same kind at step t 6 or t 7 l Of course, we can always start a new instruction, identical or of another type, after the current one has completed; no resource will be in use then l Challenge is to start another, while the current is still executing, to maximize parallel execution one has completed. No resource will be in use then. The challenge is to 1 start 1 another 1 one 1 while 0 the 0 current is executing Table 8: Collision Vector for Main Example l Show, that by adding delays we can sometimes speed up execution of pipelined ops! l That is processor architecture beauty! to start another one while the current is executing 40

41 Main Example Pipelined l For Main Example, initiate a second, pipelined instruction Y at step t 6, i.e. 5 cycles after start of X l Greedy Approach to pipeline X and Y as follows: t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10 m1 X X X Y Y Y Z m2 X X Y Y Z m3 X X Y Y m4 X Y t 11 t 12 Table 9: Pipelining 2 Instructions of Main Example l Observe two-cycle overlap, achievable speed-gain l Starting Y earlier (greedy approach) would create delays, but not retire Y any earlier 41

42 Main Example Pipelined l The 3 rd pipelined instruction Z can start at time step t 11, by which time the first X is retired, the second, named Y, is partly through l The fourth instruction can start at step t 16, etc. t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10 m1 X X X Y Y Y Z Z Z m2 X X Y Y Z Z m3 X X Y Y Z Z m4 X Y Z t 11 t 12 t 13 t 14 t 15 t 16 t 17 Table 10: Pipelining 3 Instructions of Main Example 42

43 Main Example Pipelined l Though the Reservation Table, Table 7 for the Main Example is sparsely populated, a high degree of pipelining is not possible l The maximum overlap is 2 cycles l Can one infer this low degree of pipelining from the Collision Vector alone? During pipelining we achieve 5 cycles per instruction retirement in the steady state, cpi = 5 l That means 5 cycles per completion of an instruction, assuming the same instruction is executed over and over again, once the steady state is reached! t1 t2 t3 t4 t5 t6 t7 m1 X X X m2 X X m3 X X m4 X l We ll come back to the Main Example and analyze it after further study of Examples 2 and 3 43

44 Pipeline Example 2 l Reservation Table for Example 2 has 7 entries, i.e. 7 X-es, 24 fields, 6 steps, density = Main Example had 8 entries in 28 fields, density = l We ll attempt to pipeline as many identical Example 2 instructions as possible t1 t2 t3 t4 t5 t6 m1 X X m2 X X m3 X m4 X X Table 11 : Reservation Table for Example 2 Table 12 : Collision Vector for Example 2 Figured out by Students 44

45 Pipeline Example 2, With CV l Reservation Table for Example 2 has 7 entries, i.e. 7 X-es, 24 fields, 6 steps, density = Main Example had 8 entries in 28 fields, density = l We ll attempt to pipeline as many identical Example 2 instructions as possible t1 t2 t3 t4 t5 t6 m1 X X m2 X X m3 X m4 X X Table 11: Reservation Table for Example Table 12: Collision Vector for Example 2 45

46 Pipeline Example 2 l The Collision Vector suggests to initiate a new pipelined instruction at time t 3, t 5, t 7, etc. l That would allow 3 instructions X, Y, and Z simultaneously, overlapped, pipelined. By step t 7 the first instruction would already be retired t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 m1 X Y Z X A Y B Z A B m2 X Y X Z Y A Z B A B m3 X Y Z A B m4 X X Y Y Z Z A A B B Table 13: Schedule for Pipelining Example 2 46

47 Pipeline Example 2 l Example 2 is lucky to pipeline 3 identical instructions at the same time l Caution: The CV is not a direct indicator. The reader was mildly misled to make inferences that don t strictly follow l However, if all positions in the CV were marked 1, there would be no pipelining l For Example 2 the number of cycles per instruction retirement is an amazing cpi = 2 l Even though the operation density is slightly higher than in the Main Example, the pipelining overlap in Example 2 is significantly higher, which is counter-intuitive! l On to Example 3! 47

48 Pipeline Example 3 l Interesting to see Example 3, analyzing the Collision Vector, to see how much we can parallelize! l The Reservation Table has numerous resource fields filled, yet the Collision Vector is sparser than the one in Example 2 t1 t2 t3 t4 t5 t6 m1 X X X m2 X X X m3 X X X m4 X X X Table 14: Reservation Table for Example Table 15: Collision Vector for Example 3 48

49 Pipeline Example 3 l The Collision Vector (CV) suggests to start new pipelined instruction 1, 3, or 5 cycles after initiation of the first l CV of Example 3 is less densely packed with 1s than Example 2, where we could overlap 3 identical instructions and get a rate of cpi = 2 l Goal now: find the best cpi rate for Example 3 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 m1 X Y X Y X Y X1 Y2 X1 Y2 X1 Y2 m2 X Y X Y X Y X1 Y2 X1 Y2 X1 m3 X Y X Y X Y X1 Y2 X1 Y2 X1 Y2 m4 X Y X Y X Y X1 Y2 X1 Y2 X1 Table 16: Schedule for Pipelining Example 3 49

50 Pipeline Example 3 l Example 2 earlier with Collision Vector allows a higher degree of pipelining l In Example 3, cpi = 3, every 6 cycles two instructions can retire l Contrast to cpi = 2 of Example 2 l Reason for the lower retirement rate is clear: l All 4 HW modules are used every other cycle by one of two instructions, thus one cannot overlap more than twice! l The non-pipelined cpi rate for Example 3 is cpi = 6, the pipelined rate is cpi = 3 50

51 Vertical Expansion for Example 3 l If we need higher degree of pipelining for Example 3 with a fill-factor of 0.5, we must pay! Vertically with more hardware, or horizontally with more time for added delays l Let s analyze a vertically expanded Reservation Table now with 8 Modules; every hardware resource m 1 to m 4 replicated once; lower new density = 0.25 t1 t2 t3 t4 t5 t6 m1 X X m2 X X m3 X m4 X m1,2 X m2,2 X m3,2 X X m4,2 X X Table 17: Reservation Table Example 3 with Replicated HW 51

52 Vertical Expansion for Example 3 l Let us pipeline multiple identical instructions for Reservation Table 17 as densely as possible l With twice the HW, can we overlap perhaps twice as much? The previous rate with half the hardware was cpi = 3. Ideal would be cpi = 1.5; a plausible schedule is shown in Table 18 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 m1 X Y Z A X Y Z A X Y Z A m2 X Y Z A X Y Z A X Y Z m3 X Y Z A X Y m4 X Y Z A X m1,2 X Y Z A X Y m2,2 X Y Z A X m3,2 X Y Z A X Y Z A X Y Z m4,2 X Y Z A X Y Z A X Y Table 18: Schedule for Pipelining Example 3 52

53 Vertical Expansion for Example 3 l Initiation and retirement rates are 4 instructions per 8 cycles, cpi = 2 l This is, as suspected, better than the rate of the original Example 3, not surprising with double the hardware modules l But this is not twice as good a retirement rate, despite twice the HW l Original rate was cpi = 3, the improved rate with double the hardware is cpi = 2 53

54 Horizontal Expansion, Main Example l Next case, a variation of the Main Example, shows an expansion of the Reservation Table horizontally; isn t this counter-intuitive? l I.e. delays are built-in; HW modules are kept constant l Only the 4 modules m 1 to m 4 from Main Example are provided l Motivation of an architect: if a delay can speed up execution, by all means build it in: delays are cheap! l Common sense tells us, delays tend to slow-down; real HW architect looks also at counter-intuitive situations 54

55 Horizontal Expansion, Main Example l After Examples 2 and 3, we expand the Main Example, repeated below, by adding delays, AKA Horizontal Expansion l If we insert delay cycles, clearly execution for a single instruction will slow down l However, if this yields a sufficient increase in the overall degree of pipelining, more parallelism, it may still be a win l Building circuits to delay an instruction is low cost l We analyze this variation next: 55

56 Horizontal Expansion, Main Example t1 t2 t3 t4 t5 t6 t7 m1 X X X m2 X X m3 X X m4 X Table 19: Original Reservation Table for Main Example Inserting a Delay Cycle after t 3, will be new step t 4 56

57 Horizontal Expansion, Main Example l We ll insert delays; but where? A systematic way to compute optimum position is not shown here l Instead, we ll suggest a sample position for a single delay and analyze the performance impact l Table 20 shows delay inserted after t 3, new t 4 t1 t2 t3 t4 t5 t6 t7 t8 m1 X X X m2 X X m3 X X m4 X Table 20: Reservation Table for Main Example with 1 Delay, at t Table 21: Collision Vector for Main Example with 1 Delay 57

58 Horizontal Expansion, Main Example l The Greedy Approach is to schedule instruction Y as soon as possible, when the CV has a 0 entry l This would lead us to initiate a second instruction Y at time step t 2, one cycle after instruction X. Is this optimal? t1 t2 t3 t4 t5 t6 t7 t8 t9 t1 0 m1 X Y X Y X Y Z A Z A Z A m2 X Y X Y Z A Z A m3 X Y X Y Z A Z A m4 X Y Z A t1 1 t1 2 t1 3 t1 4 t1 5 t1 6 Table 22: Schedule for Main Example Pipelined Instructions X, Y, Z, A With Delay Slot, Using Greedy Approach 58

59 Horizontal Expansion, Main Example l Initiation and retirement rates are 2 instructions every 7 cycles, or cpi = 3.5; see the purple header at each retired instruction in Table 23 l This is already better than cpi = 5 for the original Main Example without the delay l Hence we have shown that adding delays can speed up throughput of pipelined instructions l But can we do better? l After all, we have only tried out the first sample of a Greedy Approach! l Careful: Greedy smacks of short-sightedness, something EEs have to avoid 59

60 Horizontal Expansion, Main Example l In this experiment we start the second instruction at cycle t 4, three cycles after the start of the first l Which cpi rate shall we get? See Table 23 t1 t2 t3 t4 t5 t6 t7 t8 t9 t 10 m1 X X Y X Y Z Y Z X Z X Y X Y Z Y Z m2 X Y X Z Y X Z Y X Z m3 X X Y Y Z Z X X Y m4 X Y Z X Y t 11 t 12 t 13 t 14 t 15 t 16 t 17 t 18 Table 23: Another Schedule for Pipelined Main Example, with delay Initiation later than first opportunity in Table 22 Result: better throughput: Message: Starting later can improve performance! 60

61 Horizontal Expansion, Main Example l Patient schedule of Table 23 completes one identical instruction every 3 cycles in the steady state l Purple cells indicate instruction moment of retirement l X retires at completion t 8, Y after t 11, and Z after t 14 l Then X again after t 17 l Now cpi = 3 with the Not-So-Greedy Approach l Key learning: To speed up pipelined execution, one can sometimes enhance throughput by adding delay circuits, or by replicating hardware, or postponing instruction initiation, or a combination l The greedy approach is not necessarily optimal l The collision vector only states, when one cannot initiate a new instruction (value 1); a 0 value is not a solid hint for initiating a new instruction 61

62 IBM Measurements Agerwala and Cocke 1987; see [1] l Memory Bandwidth: l 1 word/cycle to fetch 1 instruction/cycle from I-cache l 40% of instructions are memory-accesses (load-store) l Those could all benefit from access to D-cache l Code Characteristics, dynamic: l ~25% of all instructions: loads l ~15% of all instructions: stores l ~40% of all instructions: ALU/RR l ~20% of all instructions: Branches 1/3 unconditional 1/3 conditional taken 1/3 conditional not taken 62

63 How Can Pipelining Work? l About 1 out of 4 or 5 instructions will be branches l Branches include all transfer of control instructions; these are: call, return, unconditional and conditional branch, abort, exception and similar machine instructions l If processor pipeline is deeper than say, 5 stages, there will almost always be a branch in the pipe, rendering several perfected operations useless l Some processors (e.g. Intel Willamette, [6]) have over 20 stages. For this type of processor, regular pipelining would practically always cause a stall! 63

64 How Can Pipelining Work? l Remedy is branch prediction l If the processor knows dynamically, from which address to fetch, instead of blindly assuming the subsequent code address pc+1, this would eliminate most pipeline flushes l Luckily, branch prediction in the 2010s has become > 97% accurate, causing only rarely the need to re-prime the pipe l Also, processors no longer are designed with the deep pipeline of the Willamette, of 20+ stages l Here we see interesting interactions of several computer architecture principles: pipelining and branch prediction, one helping the other to become exceedingly advantageous 64

65 Summary l Pipelining can speed up execution l Yet the speedup is not due to the faster clock rate l That fast clock rate manipulates significantly simpler sub-instructions and cannot be equated with original, i.e. non-pipelined clock l Counter-intuitively, pipelining may even benefit from inserting delays (at the right places) l May also benefit from initiating an instruction later than possible l And benefits, not surprisingly, from added HW resources l Branch prediction is a necessary architecture attribute to make pipelining work fast 65

66 Bibliography 1. Cocke and Schwartz, Programming Languages and their Compilers, unpublished, 1969, portal.acm.org/itation.cfm?id= Harold Stone, High Performance Computer Architecture, 1993 AW 3. cpi rate: Cycles_per_instruction 4. Introduction to PCI: articles/computer-science/protocol/introduction-topci-protocol/ 5. Wiki PCI page: Conventional_PCI 6. NetBurst_(microarchitecture) 7. PCI-X: 66

67 Some Definitions 67

68 Basic Block Definitions l Sequence of one instruction or more with a single entry point and a single exit point l Entry point may be the destination of a branch, a fall through from a conditional branch, or the program entry point; i.e. destination of an OS jump l Exit point may be an unconditional branch instruction, a call, a return, or a fall-through l Fall-through means: one instruction is a conditional flow of control change, and the subsequent instruction is executed by default, if the change in control flow does not take place l Or fall-through can mean: The successor of the exit point is a branch or call target 68

69 Definitions Collision Vector l Observation: An instruction requiring n cycles to completion may be initiated a second time n cycles after the first without possibility of conflict l For each of the n-1 cycles before that, a further instruction of identical type causes a resource conflict, if initiated l The Boolean vector of length n-1 that represents this fact stating whether or not re-issue is possible is referred to as collision vector l It can be derived from the Reservation Table 69

70 Definitions Cycles Per Instruction: cpi l cpi quantifies how long (how many cycles) it takes for a single instruction to execute l Generally, the number of execution cycles per instruction is > 1 on a CISC architecture l However, on a pipelined UP architecture, where new instruction is initiated each cycle, it is conceivable to reach cpi rate of 1; assuming no hazards l Note different meanings of cycle! l On a UP pipelined architecture cpi rate cannot shrink below one l Yet on an MP or superscalar architecture, the cpi rate may be < 1 70

71 Definitions Dependence l If the logic of the underlying program imposes an order between two instructions, there exists dependence -data or other dependence- between them l Generally, the order of execution cannot be permuted l Conventional in Computer Engineering to call this dependence, not dependency 71

72 Definitions Early Pipelined Computers/Processors: 1. CDC 6000 Series of the late 1960s 2. CDC Cyber series of the 1970s 3. IBM 360/91 series 4. Intel Pentium IV or Xeon TM processor families of the 1990s 72

73 Definitions Flushing l When a hazard occurs due to a change in flow of control, the partially execution instructions after the hazard are discarded l This discarding is called flushing l Antonym: priming l Flushing is not needed in case of a stall caused by dependences; waiting instead will resolve this 73

74 Hazard Definitions l Instruction i+1 is pre-fetched under the assumption it would be executed after instruction i l Yet after decoding i it becomes clear that that operation i is a control-transfer operation l Hence subsequently pre-fetched instructions i+1 and on are wasted l This is called a hazard l A hazard causes part of the pipeline to be flushed, while a stall (caused by data dependence) also causes a delay, but a simple wait will resolve such a stall conflict 74

75 ILP Definitions l Instruction Level Parallelism: Architectural attribute, allowing multiple instructions to be executed at the same time l Related: Superscalar 75

76 Interlock Definitions l If HW detects a conflict during execution of instructions i and j and i was initiated earlier, such a conflict, called a stall, delays execution of some j and perhaps subsequent instructions l Interlock is the architecture s way to respond to and resolve a stall at the expense of degraded performance l Advantage: computation of correct result! l Synonym: stall or wait 76

77 IPC Definitions l Instructions per cycle: A measure for Instruction Level Parallelism. How many different instructions are being executed not necessarily to completion during one single cycle? l Desired to have an IPC rate > 1 l Ideally, given suitable parallelism, IPC >> 1 l On conventional, non-pipelined UP CISC architectures it is typical to have IPC << 1 77

78 Definitions Pipelining l Mode of execution, in which one instruction is initiated every cycle and ideally one retires every cycle, even though each requires multiple (possibly many) cycles to complete l Highly pipelined Xeon processors, for example, have a greater than 20-stage pipeline 78

79 Definitions Prefetch (Instruction Prefetch) l Bringing an instruction to the execution engine before it is reached by the instruction pointer (ip) is called instruction prefetch l Generally this is done, because some other knowledge exists proving that the instruction will likely be executed soon l Possible to have branch in between, in which case the prefetch may have been wasted 79

80 Priming Definitions l Filling the various modules of a pipelined processor (the stages) with different instructions to the point of retirement of the first instruction is called priming l Antonym: flushing 80

81 Register Definition Definitions l If an arithmetic or logical operation places the result into register r i we say that r i is being defined l Synonym: Writing a register l Antonym: Register use 81

82 Definitions Reservation Table l Table that shows, which hardware resource i (AKA module m i ) is being used at which cycle in a multi-cycle instruction l Typically, an X written in the Reservation Table Matrix indicates use l Empty field indicates the corresponding resource is free during that cycle 82

83 Retire Definitions l When all parts of an instruction have successfully migrated through all execution stages, that instruction is complete l Hence, it can be discarded, this is called being retired l All results have been posted 83

84 Stall Definitions l If instruction i requires an operand o that is being computed by another instruction j, and j is not complete when i needs o, there exists dependence between the i and j, the wait thus created is called stall l A stall prevents the two instructions from being executed simultaneously, since the instruction at step i must wait for the other to complete. See also: hazard, interlock l Stall can also be caused by HW resource conflict: Some earlier instruction i may use HW resource m, while another instruction j needs m l Generally j has to wait until i frees m, causing a stall for j 84

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

CSCI-564 Advanced Computer Architecture

CSCI-564 Advanced Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Computer Architecture ELEC2401 & ELEC3441

Computer Architecture ELEC2401 & ELEC3441 Last Time Pipeline Hazard Computer Architecture ELEC2401 & ELEC3441 Lecture 8 Pipelining (3) Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Structural Hazard Hazard Control

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper

More information

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2 Pipelining CS 365 Lecture 12 Prof. Yih Huang CS 365 1 Traditional Execution 1 2 3 4 1 2 3 4 5 1 2 3 add ld beq CS 365 2 1 Pipelined Execution 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

More information

CMP 338: Third Class

CMP 338: Third Class CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference) ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. I fetch decode & eg-fetch execute memory Clock period

More information

Lecture: Pipelining Basics

Lecture: Pipelining Basics Lecture: Pipelining Basics Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4:

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. fetch decode & eg-fetch execute

More information

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining Slide 1 EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining Slide 2 Topics Clocking Clock Parameters Latch Types Requirements for reliable clocking Pipelining Optimal pipelining

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] What are the two main ways to define performance? [2] Predicting the direction of a branch is not enough. What else is necessary? [2] The power consumed by a chip has increased over time, but the clock

More information

Enrico Nardelli Logic Circuits and Computer Architecture

Enrico Nardelli Logic Circuits and Computer Architecture Enrico Nardelli Logic Circuits and Computer Architecture Appendix B The design of VS0: a very simple CPU Rev. 1.4 (2009-10) by Enrico Nardelli B - 1 Instruction set Just 4 instructions LOAD M - Copy into

More information

3. (2) What is the difference between fixed and hybrid instructions?

3. (2) What is the difference between fixed and hybrid instructions? 1. (2 pts) What is a "balanced" pipeline? 2. (2 pts) What are the two main ways to define performance? 3. (2) What is the difference between fixed and hybrid instructions? 4. (2 pts) Clock rates have grown

More information

Performance, Power & Energy

Performance, Power & Energy Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance?

More information

4. (3) What do we mean when we say something is an N-operand machine?

4. (3) What do we mean when we say something is an N-operand machine? 1. (2) What are the two main ways to define performance? 2. (2) When dealing with control hazards, a prediction is not enough - what else is necessary in order to eliminate stalls? 3. (3) What is an "unbalanced"

More information

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 2. (2 )What are the two main ways to define performance? 3. (2 )What

More information

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing:

More information

Introduction The Nature of High-Performance Computation

Introduction The Nature of High-Performance Computation 1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential

More information

2. Accelerated Computations

2. Accelerated Computations 2. Accelerated Computations 2.1. Bent Function Enumeration by a Circular Pipeline Implemented on an FPGA Stuart W. Schneider Jon T. Butler 2.1.1. Background A naive approach to encoding a plaintext message

More information

Simple Instruction-Pipelining (cont.) Pipelining Jumps

Simple Instruction-Pipelining (cont.) Pipelining Jumps 6.823, L9--1 Simple ruction-pipelining (cont.) + Interrupts Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Src1 ( j / ~j ) Src2 ( / Ind) Pipelining Jumps

More information

ENEE350 Lecture Notes-Weeks 14 and 15

ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl s Law ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining is a method of processing in which a problem is divided into a number of sub problems and solved and the solu8ons of the sub problems

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law Topics 2 Page The Nature of Time real (i.e. wall clock) time = User Time: time spent

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Topics Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law 2 The Nature of Time real (i.e. wall clock) time = User Time: time spent executing

More information

Project Two RISC Processor Implementation ECE 485

Project Two RISC Processor Implementation ECE 485 Project Two RISC Processor Implementation ECE 485 Chenqi Bao Peter Chinetti November 6, 2013 Instructor: Professor Borkar 1 Statement of Problem This project requires the design and test of a RISC processor

More information

Lecture 3, Performance

Lecture 3, Performance Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations

More information

Portland State University ECE 587/687. Branch Prediction

Portland State University ECE 587/687. Branch Prediction Portland State University ECE 587/687 Branch Prediction Copyright by Alaa Alameldeen and Haitham Akkary 2015 Branch Penalty Example: Comparing perfect branch prediction to 90%, 95%, 99% prediction accuracy,

More information

Lecture 13: Sequential Circuits, FSM

Lecture 13: Sequential Circuits, FSM Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines 1 Clocks A microprocessor is composed of many different circuits that are operating simultaneously if each

More information

Digital Logic. CS211 Computer Architecture. l Topics. l Transistors (Design & Types) l Logic Gates. l Combinational Circuits.

Digital Logic. CS211 Computer Architecture. l Topics. l Transistors (Design & Types) l Logic Gates. l Combinational Circuits. CS211 Computer Architecture Digital Logic l Topics l Transistors (Design & Types) l Logic Gates l Combinational Circuits l K-Maps Figures & Tables borrowed from:! http://www.allaboutcircuits.com/vol_4/index.html!

More information

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control Logic and Computer Design Fundamentals Chapter 8 Sequencing and Control Datapath and Control Datapath - performs data transfer and processing operations Control Unit - Determines enabling and sequencing

More information

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control The MIPS Pipeline CSCI206 - Computer Organization & Programming Pipeline Datapath and Control zybook: 11.6 Developed and maintained by the Bucknell University Computer Science Department - 2017 Hazard

More information

Processor Design & ALU Design

Processor Design & ALU Design 3/8/2 Processor Design A. Sahu CSE, IIT Guwahati Please be updated with http://jatinga.iitg.ernet.in/~asahu/c22/ Outline Components of CPU Register, Multiplexor, Decoder, / Adder, substractor, Varity of

More information

ICS 233 Computer Architecture & Assembly Language

ICS 233 Computer Architecture & Assembly Language ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by

More information

Table of Content. Chapter 11 Dedicated Microprocessors Page 1 of 25

Table of Content. Chapter 11 Dedicated Microprocessors Page 1 of 25 Chapter 11 Dedicated Microprocessors Page 1 of 25 Table of Content Table of Content... 1 11 Dedicated Microprocessors... 2 11.1 Manual Construction of a Dedicated Microprocessor... 3 11.2 FSM + D Model

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

Lecture 3, Performance

Lecture 3, Performance Repeating some definitions: Lecture 3, Performance CPI MHz MIPS MOPS Clocks Per Instruction megahertz, millions of cycles per second Millions of Instructions Per Second = MHz / CPI Millions of Operations

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 09/10, Jan., 2018 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 43 Most Essential Assumptions for Real-Time Systems Upper

More information

CPU DESIGN The Single-Cycle Implementation

CPU DESIGN The Single-Cycle Implementation CSE 202 Computer Organization CPU DESIGN The Single-Cycle Implementation Shakil M. Khan (adapted from Prof. H. Roumani) Dept of CS & Eng, York University Sequential vs. Combinational Circuits Digital circuits

More information

CMP 334: Seventh Class

CMP 334: Seventh Class CMP 334: Seventh Class Performance HW 5 solution Averages and weighted averages (review) Amdahl's law Ripple-carry adder circuits Binary addition Half-adder circuits Full-adder circuits Subtraction, negative

More information

EE 660: Computer Architecture Out-of-Order Processors

EE 660: Computer Architecture Out-of-Order Processors EE 660: Computer Architecture Out-of-Order Processors Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Based on the slides of Prof. David entzlaff Agenda I4 Processors I2O2

More information

CS470: Computer Architecture. AMD Quad Core

CS470: Computer Architecture. AMD Quad Core CS470: Computer Architecture Yashwant K. Malaiya, Professor malaiya@cs.colostate.edu AMD Quad Core 1 Architecture Layers Building blocks Gates, flip-flops Functional bocks: Combinational, Sequential Instruction

More information

CS 52 Computer rchitecture and Engineering Lecture 4 - Pipelining Krste sanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste! http://inst.eecs.berkeley.edu/~cs52!

More information

Virtualization. Introduction. G. Lettieri A/A 2014/15. Dipartimento di Ingegneria dell Informazione Università di Pisa. G. Lettieri Virtualization

Virtualization. Introduction. G. Lettieri A/A 2014/15. Dipartimento di Ingegneria dell Informazione Università di Pisa. G. Lettieri Virtualization Introduction G. Lettieri Dipartimento di Ingegneria dell Informazione Università di Pisa A/A 2014/15 G. Lettieri Folk definition What do people mean when they talk about virtualization w.r.t. computers?

More information

CPSC 3300 Spring 2017 Exam 2

CPSC 3300 Spring 2017 Exam 2 CPSC 3300 Spring 2017 Exam 2 Name: 1. Matching. Write the correct term from the list into each blank. (2 pts. each) structural hazard EPIC forwarding precise exception hardwired load-use data hazard VLIW

More information

Microprocessor Power Analysis by Labeled Simulation

Microprocessor Power Analysis by Labeled Simulation Microprocessor Power Analysis by Labeled Simulation Cheng-Ta Hsieh, Kevin Chen and Massoud Pedram University of Southern California Dept. of EE-Systems Los Angeles CA 989 Outline! Introduction! Problem

More information

Adders, subtractors comparators, multipliers and other ALU elements

Adders, subtractors comparators, multipliers and other ALU elements CSE4: Components and Design Techniques for Digital Systems Adders, subtractors comparators, multipliers and other ALU elements Instructor: Mohsen Imani UC San Diego Slides from: Prof.Tajana Simunic Rosing

More information

Scheduling of Frame-based Embedded Systems with Rechargeable Batteries

Scheduling of Frame-based Embedded Systems with Rechargeable Batteries Scheduling of Frame-based Embedded Systems with Rechargeable Batteries André Allavena Computer Science Department Cornell University Ithaca, NY 14853 andre@cs.cornell.edu Daniel Mossé Department of Computer

More information

Lecture 13: Sequential Circuits, FSM

Lecture 13: Sequential Circuits, FSM Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines Reminder: midterm on Tue 2/28 will cover Chapters 1-3, App A, B if you understand all slides, assignments,

More information

Ch 7. Finite State Machines. VII - Finite State Machines Contemporary Logic Design 1

Ch 7. Finite State Machines. VII - Finite State Machines Contemporary Logic Design 1 Ch 7. Finite State Machines VII - Finite State Machines Contemporary Logic esign 1 Finite State Machines Sequential circuits primitive sequential elements combinational logic Models for representing sequential

More information

Propositional Logic. Logical Expressions. Logic Minimization. CNF and DNF. Algebraic Laws for Logical Expressions CSC 173

Propositional Logic. Logical Expressions. Logic Minimization. CNF and DNF. Algebraic Laws for Logical Expressions CSC 173 Propositional Logic CSC 17 Propositional logic mathematical model (or algebra) for reasoning about the truth of logical expressions (propositions) Logical expressions propositional variables or logical

More information

Exam Spring Embedded Systems. Prof. L. Thiele

Exam Spring Embedded Systems. Prof. L. Thiele Exam Spring 20 Embedded Systems Prof. L. Thiele NOTE: The given solution is only a proposal. For correctness, completeness, or understandability no responsibility is taken. Sommer 20 Eingebettete Systeme

More information

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on

More information

From Sequential Circuits to Real Computers

From Sequential Circuits to Real Computers 1 / 36 From Sequential Circuits to Real Computers Lecturer: Guillaume Beslon Original Author: Lionel Morel Computer Science and Information Technologies - INSA Lyon Fall 2017 2 / 36 Introduction What we

More information

Goals for Performance Lecture

Goals for Performance Lecture Goals for Performance Lecture Understand performance, speedup, throughput, latency Relationship between cycle time, cycles/instruction (CPI), number of instructions (the performance equation) Amdahl s

More information

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle Computer Engineering Department CC 311- Computer Architecture Chapter 4 The Processor: Datapath and Control Single Cycle Introduction The 5 classic components of a computer Processor Input Control Memory

More information

Input-queued switches: Scheduling algorithms for a crossbar switch. EE 384X Packet Switch Architectures 1

Input-queued switches: Scheduling algorithms for a crossbar switch. EE 384X Packet Switch Architectures 1 Input-queued switches: Scheduling algorithms for a crossbar switch EE 84X Packet Switch Architectures Overview Today s lecture - the input-buffered switch architecture - the head-of-line blocking phenomenon

More information

Fall 2011 Prof. Hyesoon Kim

Fall 2011 Prof. Hyesoon Kim Fall 2011 Prof. Hyesoon Kim Add: 2 cycles FE_stage add r1, r2, r3 FE L ID L EX L MEM L WB L add add sub r4, r1, r3 sub sub add add mul r5, r2, r3 mul sub sub add add mul sub sub add add mul sub sub add

More information

COMPUTER SCIENCE TRIPOS

COMPUTER SCIENCE TRIPOS CST.2016.2.1 COMPUTER SCIENCE TRIPOS Part IA Tuesday 31 May 2016 1.30 to 4.30 COMPUTER SCIENCE Paper 2 Answer one question from each of Sections A, B and C, and two questions from Section D. Submit the

More information

Design at the Register Transfer Level

Design at the Register Transfer Level Week-7 Design at the Register Transfer Level Algorithmic State Machines Algorithmic State Machine (ASM) q Our design methodologies do not scale well to real-world problems. q 232 - Logic Design / Algorithmic

More information

Hardware Design I Chap. 4 Representative combinational logic

Hardware Design I Chap. 4 Representative combinational logic Hardware Design I Chap. 4 Representative combinational logic E-mail: shimada@is.naist.jp Already optimized circuits There are many optimized circuits which are well used You can reduce your design workload

More information

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Math 471 (Numerical methods) Chapter 3 (second half). System of equations Math 47 (Numerical methods) Chapter 3 (second half). System of equations Overlap 3.5 3.8 of Bradie 3.5 LU factorization w/o pivoting. Motivation: ( ) A I Gaussian Elimination (U L ) where U is upper triangular

More information

Residue Number Systems Ivor Page 1

Residue Number Systems Ivor Page 1 Residue Number Systems 1 Residue Number Systems Ivor Page 1 7.1 Arithmetic in a modulus system The great speed of arithmetic in Residue Number Systems (RNS) comes from a simple theorem from number theory:

More information

Vector Lane Threading

Vector Lane Threading Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program

More information

Counters. We ll look at different kinds of counters and discuss how to build them

Counters. We ll look at different kinds of counters and discuss how to build them Counters We ll look at different kinds of counters and discuss how to build them These are not only examples of sequential analysis and design, but also real devices used in larger circuits 1 Introducing

More information

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

MICROPROCESSOR REPORT.   THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE MICROPROCESSOR www.mpronline.com REPORT THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE ENERGY COROLLARIES TO AMDAHL S LAW Analyzing the Interactions Between Parallel Execution and Energy Consumption By

More information

UNIVERSITY OF WISCONSIN MADISON

UNIVERSITY OF WISCONSIN MADISON CS/ECE 252: INTRODUCTION TO COMPUTER ENGINEERING UNIVERSITY OF WISCONSIN MADISON Prof. Gurindar Sohi TAs: Minsub Shin, Lisa Ossian, Sujith Surendran Midterm Examination 2 In Class (50 minutes) Friday,

More information

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I. Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

CprE 281: Digital Logic

CprE 281: Digital Logic CprE 28: Digital Logic Instructor: Alexander Stoytchev http://www.ece.iastate.edu/~alexs/classes/ Simple Processor CprE 28: Digital Logic Iowa State University, Ames, IA Copyright Alexander Stoytchev Digital

More information

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering TIMING ANALYSIS Overview Circuits do not respond instantaneously to input changes

More information

CSE370: Introduction to Digital Design

CSE370: Introduction to Digital Design CSE370: Introduction to Digital Design Course staff Gaetano Borriello, Brian DeRenzi, Firat Kiyak Course web www.cs.washington.edu/370/ Make sure to subscribe to class mailing list (cse370@cs) Course text

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19

More information

Unit 6: Branch Prediction

Unit 6: Branch Prediction CIS 501: Computer Architecture Unit 6: Branch Prediction Slides developed by Joe Devie/, Milo Mar4n & Amir Roth at Upenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi,

More information

Lecture 11. Advanced Dividers

Lecture 11. Advanced Dividers Lecture 11 Advanced Dividers Required Reading Behrooz Parhami, Computer Arithmetic: Algorithms and Hardware Design Chapter 15 Variation in Dividers 15.3, Combinational and Array Dividers Chapter 16, Division

More information

Binary addition example worked out

Binary addition example worked out Binary addition example worked out Some terms are given here Exercise: what are these numbers equivalent to in decimal? The initial carry in is implicitly 0 1 1 1 0 (Carries) 1 0 1 1 (Augend) + 1 1 1 0

More information

Digital System Clocking: High-Performance and Low-Power Aspects. Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M.

Digital System Clocking: High-Performance and Low-Power Aspects. Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M. Digital System Clocking: High-Performance and Low-Power Aspects Vojin G. Oklobdzija, Vladimir M. Stojanovic, Dejan M. Markovic, Nikola M. Nedovic Wiley-Interscience and IEEE Press, January 2003 Nov. 14,

More information

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1> Chapter 5 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 5 Chapter 5 :: Topics Introduction Arithmetic Circuits umber Systems Sequential Building

More information

L07-L09 recap: Fundamental lesson(s)!

L07-L09 recap: Fundamental lesson(s)! L7-L9 recap: Fundamental lesson(s)! Over the next 3 lectures (using the IPS ISA as context) I ll explain:! How functions are treated and processed in assembly! How system calls are enabled in assembly!

More information

ww.padasalai.net

ww.padasalai.net t w w ADHITHYA TRB- TET COACHING CENTRE KANCHIPURAM SUNDER MATRIC SCHOOL - 9786851468 TEST - 2 COMPUTER SCIENC PG - TRB DATE : 17. 03. 2019 t et t et t t t t UNIT 1 COMPUTER SYSTEM ARCHITECTURE t t t t

More information

Branch Prediction using Advanced Neural Methods

Branch Prediction using Advanced Neural Methods Branch Prediction using Advanced Neural Methods Sunghoon Kim Department of Mechanical Engineering University of California, Berkeley shkim@newton.berkeley.edu Abstract Among the hardware techniques, two-level

More information

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5.

CHAPTER log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C * 9-4.* (Errata: Delete 1 after problem number) 9-5. CHPTER 9 2008 Pearson Education, Inc. 9-. log 2 64 = 6 lines/mux or decoder 9-2.* C = C 8 V = C 8 C 7 Z = F 7 + F 6 + F 5 + F 4 + F 3 + F 2 + F + F 0 N = F 7 9-3.* = S + S = S + S S S S0 C in C 0 dder

More information

Latches. October 13, 2003 Latches 1

Latches. October 13, 2003 Latches 1 Latches The second part of CS231 focuses on sequential circuits, where we add memory to the hardware that we ve already seen. Our schedule will be very similar to before: We first show how primitive memory

More information

ECE 448 Lecture 6. Finite State Machines. State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL Code. George Mason University

ECE 448 Lecture 6. Finite State Machines. State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL Code. George Mason University ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL Code George Mason University Required reading P. Chu, FPGA Prototyping by VHDL Examples

More information

ECE/CS 250 Computer Architecture

ECE/CS 250 Computer Architecture ECE/CS 250 Computer Architecture Basics of Logic Design: Boolean Algebra, Logic Gates (Combinational Logic) Tyler Bletsch Duke University Slides are derived from work by Daniel J. Sorin (Duke), Alvy Lebeck

More information

Contents. Chapter 3 Combinational Circuits Page 1 of 36

Contents. Chapter 3 Combinational Circuits Page 1 of 36 Chapter 3 Combinational Circuits Page of 36 Contents Combinational Circuits...2 3. Analysis of Combinational Circuits...3 3.. Using a Truth Table...3 3..2 Using a Boolean Function...6 3.2 Synthesis of

More information

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT CSE 560 Practice Problem Set 4 Solution 1. In this question, you will examine several different schemes for branch prediction, using the following code sequence for a simple load store ISA with no branch

More information

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University Prof. Mi Lu TA: Ehsan Rohani Laboratory Exercise #4 MIPS Assembly and Simulation

More information

Lecture 12: Energy and Power. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 12: Energy and Power. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 12: Energy and Power James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L12 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today a working understanding of

More information

CMPT-150-e1: Introduction to Computer Design Final Exam

CMPT-150-e1: Introduction to Computer Design Final Exam CMPT-150-e1: Introduction to Computer Design Final Exam April 13, 2007 First name(s): Surname: Student ID: Instructions: No aids are allowed in this exam. Make sure to fill in your details. Write your

More information

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design Boolean Algebra, Logic Gates

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design Boolean Algebra, Logic Gates ECE 250 / CPS 250 Computer Architecture Basics of Logic Design Boolean Algebra, Logic Gates Benjamin Lee Slides based on those from Andrew Hilton (Duke), Alvy Lebeck (Duke) Benjamin Lee (Duke), and Amir

More information

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions 18-447 Lectre 12: Pipelined Implementations: Control Hazards and Resoltions S 09 L12-1 James C. Hoe Dept of ECE, CU arch 2, 2009 Annoncements: Spring break net week!! Project 2 de the week after spring

More information

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide) Out-of-order Pipeline Buffer of instructions Issue = Select + Wakeup Select N oldest, read instructions N=, xor N=, xor and sub Note: ma have execution resource constraints: i.e., load/store/fp Fetch Decode

More information

LABORATORY MANUAL MICROPROCESSOR AND MICROCONTROLLER

LABORATORY MANUAL MICROPROCESSOR AND MICROCONTROLLER LABORATORY MANUAL S u b j e c t : MICROPROCESSOR AND MICROCONTROLLER TE (E lectr onics) ( S e m V ) 1 I n d e x Serial No T i tl e P a g e N o M i c r o p r o c e s s o r 8 0 8 5 1 8 Bit Addition by Direct

More information

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits An Enhanced Two-Level Adaptive Multiple Branch Prediction for Superscalar Processors Jong-bok Lee, Soo-Mook Moon and Wonyong Sung fjblee@mpeg,smoon@altair,wysung@dspg.snu.ac.kr School of Electrical Engineering,

More information

The Design Procedure. Output Equation Determination - Derive output equations from the state table

The Design Procedure. Output Equation Determination - Derive output equations from the state table The Design Procedure Specification Formulation - Obtain a state diagram or state table State Assignment - Assign binary codes to the states Flip-Flop Input Equation Determination - Select flipflop types

More information

Lecture 2: Metrics to Evaluate Systems

Lecture 2: Metrics to Evaluate Systems Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video

More information

Scalable Store-Load Forwarding via Store Queue Index Prediction

Scalable Store-Load Forwarding via Store Queue Index Prediction Scalable Store-Load Forwarding via Store Queue Index Prediction Tingting Sha, Milo M.K. Martin, Amir Roth University of Pennsylvania {shatingt, milom, amir}@cis.upenn.edu addr addr addr (CAM) predictor

More information