CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture Appendix C

Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc) 2

Pipelining: Introduction Implementation technique in which multiple instructions are overlapped in execution (instruction execution steps could run in parallel You have 4 loads of cloths to wash: Steps (stages) required: Wash Dry Fold Store clothes into drawers A B C D Each stage needs 30 minutes We can t start the next step until the previous step is finished 3

Pipelining Example: Laundry There are 2 approaches to do this job: Sequential (non-pipelined): Wait until the first load is put away in order to start the next load Pipelined (ASAP): As soon as the washer is empty, start putting the next load, while the first load is put into dryer 4

Pipelining Example: Laundry Sequential Laundry Needs 8 hours for 4 loads T i m e T a s k o r d e r A 6 P M 7 8 9 10 11 12 1 2 A M B C D 5

Pipelining Example: Laundry Pipelined Laundry: Start work ASAP Needs only 3.5 hours for 4 loads! 6 P M 7 8 9 10 1 12 1 2 A M T i m e T a s k o r d e r A B C D 6

CPU Pipelining Review: 5 stages of a MIPS instruction Fetch instruction from instruction memory Read registers while decoding instruction Execute operation or calculate address, depending on the instruction type Access an operand from data memory Write result into a register We can reduce the cycles to fit the stages. Do you catch an advantage of load/stor RISC architecture over CISC? Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load /Dec Exec Mem Wr 7

CPU Pipelining isters should be added to separate between stages and stabilize the signals passed from one stage to the next 8

Pipelining Example: Observations Pipelining Observations: After filling the pipeline, all stages will be operating concurrently Pipelining doesn t reduce number of stages doesn t help latency of single task helps throughput of entire workload In order to pipeline the task, we should have separate resources. (Multiple tasks operating simultaneously use different resources) 9

Pipelining: Speedup Speedup due to pipelining depends on the number of stages in the pipeline Ideal maximum speedup = number of stages (pipeline depth) Why ideal speedup never been achieved Stages are usually not balanced and pipeline rate is limited by slowest pipeline stage If dryer needs 45 min, time for all stages has to be 45 min to accommodate it Pipelining overheads (like inter-stage registers delay) Time to fill the pipeline and time to drain it If one load depends on another, we will have to wait (Delay/Stall for Dependencies or Hazards) 10

CPU Pipelining: Examples Example (1): Textbook Page C-10 Example (2): Single-Cycle, non-pipelined execution Total time for 3 instructions: 24 ns P r o g r a m e x e c u t i o n o r d e r T i m e ( i n i n s t r u c t i o n s ) l w $ 1, 1 0 0 ( $ 0 ) l w $ 2, 2 0 0 ( $ 0 ) l w $ 3, 3 0 0 ( $ 0 ) I n s t r u c t i o n f e t c h 2 4 6 8 1 0 1 2 1 4 1 6 1 8 R e g A L U 8 n s D a t a a c c e s s R e g I n s t r u c t i o n f e t c h R e g A L U 8 ns D a t a a c c e s s R e g I n s t r u c t i o n f e t c h... 8 n s 11

CPU Pipelining: Example Single-cycle, pipelined execution Improve performance by increasing instruction throughput Total time for 3 instructions = 14 ns Each instruction adds 2 ns to total execution time Stage time limited by slowest resource (2 ns) Assumptions: Write to register occurs in 1 st half of clock Read from register occurs in 2 nd half of clock P r o g r a m e x e c u t i o n o r d e r ( i n i n s t r u c t i o n s ) T i m e l w $ 1, 1 0 0 ( $ 0 ) I n s t r u c t i o n f e t c h 2 4 6 8 1 0 1 2 1 4 R e g A L U D a t a a c c e s s R e g l w $ 2, 2 0 0 ( $ 0 ) 2 n s I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g l w $ 3, 3 0 0 ( $ 0 ) 2 n s I n s t r u c t i o n f e t c h R e g A L U D a t a a c c e s s R e g 2 n s 2 n s 2 n s 2 n s 2 n s 12

Pipelining Hazards Hazards: Situations that prevent an instruction from being executed in its designated clock cycle Types of Hazards 1. Structural Hazards: Conflict on resources 2. Data Hazards: An instruction depends on the results of previous instruction 3. Control Hazards: When an instruction changes PC, like branches The simplest hazards solution is to stall the pipeline (some instructions are allowed to proceed, while other are delayed) Speedup = pipeline depth/(1+average stall cycles per instruction)

Structural hazards No two instructions are processed by the same module at the same time. Solutions: Stalling the pipeline (inserting bubbles) or duplicating the resource Instruction fetch is conflicting with data memory access Use Harvard architecture with two separate caches ister file is accessed by the instruction in the decode (reading) and the instruction in the write-back (writing) Ensure that writing is done in the first half of the clock and reading is done at the second half of the clock

Solving Structural Hazards by Stalling Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4 Structural Hazard 15

Solving Structural Hazards by Stalling Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Instr 1 Instr 2 Stall Instr 3 Bubble Bubble Bubble Bubble Bubble 16

Data Hazards Pipelining might change the order of reading/writing operands from that of sequential execution Time (clock cycles) IF ID/RF EX MEM WB I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Types of Data Hazards Read After Write (RAW): Instruction J tries to read an operand before Instruction I writes it (True data dependency) I: add r1,r2,r3 J: sub r4,r1,r3 Write After Read (WAR): Instruction J writes an operand before Instruction I reads it (Anti dependency) I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Write After Write (WAW): Instruction J writes an operand before Instruction I writes it (output dependency) I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Data Hazards Solutions Writing in the first half of the cycle and reading in the second half mitigates the effect of data hazards Solution (1): Stalling the pipeline: inserting bubbles and stall the instruction till the data is written back Solution (2): Data Forwarding or bypassing : forward the data back to the requesting stage immediately when it is available Solution (3): Software-solution: Compiler arranges instructions to avoid data dependencies

Solving Data Hazards by Forwarding Time (clock cycles) I n s t r. add r1,r2,r3 sub r4,r1,r3 O r d e r and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 20

HW Change for Forwarding NextPC isters Immediate ID/EX mux mux EX/MEM Data Memory MEM/WR mux 21

Data Hazard Even with Forwarding Time (clock cycles) I n s t r. lw r1, 0(r2) sub r4,r1,r6 O r d e r and r6,r1,r7 or r8,r1,r9 22

Resolving the Load Data Hazard Time (clock cycles) I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Bubble Bubble or r8,r1,r9 Bubble Pipeline should be stalled till the data becomes available 23

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,ra SUB Rd,Re,Rf SW d,rd 24