ENEE350 Lecture Notes-Weeks 14 and 15

Size: px

Start display at page:

Download "ENEE350 Lecture Notes-Weeks 14 and 15"

Gladys Anderson
5 years ago
Views:

1 Pipelining & Amdahl s Law ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining is a method of processing in which a problem is divided into a number of sub problems and solved and the solu8ons of the sub problems for different instances of the problem are then overlapped.

2 Example: a[i] = b[i] + c[i] + d[i] + e[i] + f[i], i = 1, 2, 3,,n c[2] c[1] f[2] f[1] e[2] D e[1] d[2] D D d[1] D D D b[1] a[2] a[1] Adders have delay D to compute. Computation time = 4D + (n-1)d = nd +3D Speed-up = 4nD/{3D + nd} -> 4 for large n.

3 We can describe the computa8on process in an n segment pipeline algorithmically. There are three dis8nct phases to this computa8on: (a) filling the pipeline, (b) running the pipeline in the filled state un8l the last input arrives, and (c) emptying the pipeline.

4 Example: Pipelined Ripple Adder u[m 1n 11], v[m 1,n 1] u[m 1,1], v[m 1,1] u[1,n 1], v[1,n 1] u[0,n 1], v[0,n 1] u[m 1,0], v[m 1,0] u[1,0], v[1,0] u[1,1], v[1,1] u[0,1], v[0,1] u[0,0], v[0,0] u[m 1:0 n 1 + v[m 1:0 n 1] u[1:0 n 1] + v[1:0 n 1] u[0:0 n 1] + v[0:0 n 1] 0 FA 0 D FA0 D FA n 1 D clock

5 Instruc6on pipelines: Goal: (i) to increase the throughput (number of instruc8ons/sec) in execu8ng programs (ii) to reduce the execu8on 8me (clock cycles/instruc8on, etc). clock 0 fetch decode execute I 1 1 I 2 I 1 2 I 3 I 2 I 1 3 I 4 I 3 I 2 4 I 4 I 3

6 A 5 stage (MIPS) pipeline clock fetch decode execute memory write back 0 I 1 1 I 2 I 1 2 I 3 I 2 I 1 3 I 4 I 3 I 2 I 1 4 I 5 I 4 I 3 I 2 I 1

7 Speed up of pipelined execu6on of instruc6ons over a sequen6al execu6on: S(5) = T 1 T p = CPI un u / f u CPI p N p / f p N u : The number of instruc8ons executed by serial system N p : The number of instruc8ons executed by pipeline system CPI u : Number of clock cycles per instruc8on for serial system CPI p : Number of clock cycles per instruc8on for pipeline system f u : Clock frequency of serial system f p : Clock frequency of pipeline system Assuming that the serial and pipeline systems both operate at the same clock rate and use the same number of opera8ons: S(5) = CPI u CPI p

8 Example Suppose that the instruc8on mix of programs executed on a serial and pipeline machines is 40% ALU, 20% branching, and 40% memory with 4, 2, and 4 cycles per each instruc8on in the three classes respec8vely. Then, under ideal condi8ons (no stalls due to hazards) S(5) = CPI u CPI p = = 3.3 If, the clock speed needs to be increased for the pipeline implementa8on then the speed up will have to be scaled down accordingly using the formula on the previous slide.

9 Instruction Pipelines MIPS (Hennessy & Patterson)

10 MIPS Pipeline IF ID EX WB Register operations IF ID EX ME WB Register/Memory operations

11 Hazards 1 Structural Hazards 2 Data Hazards 3 Control Hazards

12 Structural Hazards: They arise when limited resources are scheduled to operate concurrently on different streams during the same clock period. Example: Memory conflict (data fetch + instruc8on fetch) or datapath conflict (arithme8c opera8on + PC update) Clock IF ID EX ME WB 0 I 1 1 I 2 I 1 2 I 3 I 2 I 1 3 I 4 I 3 I 2 I 1 4 I 5 I 4 I 3 I 2 I 1 5 I 6 I 5 I 4 I 3 I 2 6 I 7 I 6 I 5 I 4 I 3

13 Fix: Duplicate hardware (too expensive) Stall the pipeline (serialize the opera8on) (too slow) Clock IF ID EX ME WB 0 I 1 1 I 2 I 1 2 I 2 I 1 3 I 2 I 1 4 I 3 I 2 I 1 5 I 4 I 3 I 2 6 I 4 I 3 7 I 4 I 3 8 I 5 I 4 I 3 9 I 6 I 5 I 4

14 Speed up = T serial /T pipeline = 5nt s / {2nt s + 2t s }, for odd n = 5nt s / {2nt s + 3t s }, for even n > 5/2 as the number of instruc8ons, n, tends to infinity. Thus, we loose half the throughput due to stalls. Note: The pipeline 8me of execu8on can be computed using the recurrences T 1 = 4 T i = T i for even i T i = T i for odd i T 1 = 4, T 2 = 4 +1 = 5, T 3 = 5+3 = 8, T 4 = 8 +1 = 9, T 5 = 9+3 = 12, T 6 = = 13, T n = 2n + 2, T n+1 = 2n = 2n + 3

15 Data Hazards They occur when the execu8ons of two instruc8ons may result in the incorrect reading of operands and/or wri8ng of a result. Read Aher Write (RAW) Hazard (Data Dependency) Write Aher Read Hazard (WAR) (Data An8 dependency) Write Aher Write Hazard (WAW) (Data An8 dependency)

16 RAW Hazards They occur when reads are early and writes are late. Clock IF ID EX ME WB 0 I 1 1 I 2 I 1 2 I 3 I 2 I 1 3 I 4 I 3 Read I 1 4 I 5 I 4 I 3 I 2 Write 5 I 6 I 5 I 4 I 3 I 2 6 I 7 I 6 I 5 I 4 I 3 I 2 : R 3 = R 1 + R 2 I 1 : R 1 = R 1 + R 2

17 RAW Hazards (Cont d) They can be avoided by stalling the reads but this increases the execu8on 8me. A beier approach is to use data forwarding: Clock IF ID EX ME WB 0 I 1 1 I 2 I 1 2 I 3 I 2 I 1 3 I 4 I 3 Read I 2 4 I 5 I 4 I 3 I 2 Write I 1 5 I 6 I 5 I 4 I 3 I 2 6 I 7 I 6 I 5 I 4 I 3 I 2 : R 3 = R 1 + R 2 I 1 : R 1 = R 1 + R 2

18 WAR Hazards They occur when writes are early and reads are late Clock IF ID EX ME WB EX ME WB 0 I 1 1 I 2 I 1 2 I 3 I 2 I 1 3 I 4 I 3 I 2 I 1 4 I 5 I 4 I 3 I 2 I 1 5 I 6 I 5 I 4 I 3 Write Read 6 I 7 I 6 I 5 I 4 I 3 I 2 I 1 I 4 I 3 I 2 I 1 I 2 : R 3 = R 7 + R 5 ; R 6 = R 2 + R 8 I 1 : R 2 = R 2 + R 3 ; R 9 = R 3 + R 4

19 Branch Prediction in Pipeline Instruction Sequencing One of the major issues in pipelined instruc8on processing is to schedule condi8onal branch instruc8ons. When a pipeline controller encounters a condi8onal branch instruc8on it has a choice to decode it into one of two instruc8on streams. If the branch condi8on is met then the execu8on con8nues from the target of the condi8onal branch instruc8on; Otherwise, it con8nues with the instruc8on that follows the condi8onal branch instruc8on. As there are other instruc8ons moving behind a condi8onal branch instruc8on, it is necessary to have a system which can flush the pipeline in case the branch condi8on is mispredicted.

20 Example: Suppose that we execute the following assembly code on a 5 stage pipeline (IF, ID,EX,ME, WB): LDI R0 = 20; JCD R0 < 10, add; SUB R0,R1; JMP D,halt; add: ADD R0,R1; halt: HLT; If we assume that R0 < 10 then the SUB instruc8on would have been incorrectly fetched during the second clock cycle. and we will have to execute another fetch cycle to fetch the ADD instruc8on.

21 Classifica6on of branch predic6on algorithms Sta:c Branch Predic:on: The branch decision does not change over 8me we use a fixed branching policy. Dynamic Branch Predic:on: The branch decision does change over 8me we use a branching policy that varies over 8me.

22 Sta6c Branch Predic6on Algorithms 1 Don t predict (stall the pipeline) 2 Never take the branch 3 Always take the branch 4 Delayed branch

23 1 Stall the pipeline by 1 clock cycle : This allows us to determine the target of the branch instruc8on. JCD IF ID EX ME WB SUB IF ID EX ME WB ADD IF ID EX ME WB Pipeline proceeds with one of the Instruc8ons Stall and decide the branch.

24 Pipeline Execu8on Speed (stall case): Assuming only branch hazards, we can compute the average number of clock cycles per instruc8on (CPI) as CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruc8on = 1 + branch penalty branch frequency = 1 + branch frequency In general, CPI of the pipeline > 1 + branch frequency because of data and possibly structural hazards Pros: Straighqorward to implement Cons: The 8me overhead is high when the instruc8on mix includes a high percentage of branch instruc8ons.

25 2 Never take the branch. The instruc8on in the pipeline is flushed if it is determined that the branch should have been taken aher the ID stage is carried out. JCD IF ID EX ME WB SUB IF ID EX ME WB IOR IF ID EX ME WB Execute this if the branch fails XOR IF ID EX ME WB SUB instruc8on is always fetched and then either it is decoded and executed next or it is flushed and XOR is fetched and executed.

26 Pipeline Execu8on Speed (Never take the branch case): Assuming only branch hazards, we can compute the average number of clock cycles per instruc8on (CPI) as CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruc8on = 1 + branch penalty branch frequency mispredic8on rate = 1 + branch frequency mispredic8on rate Pros: If the predic8on is highly accurate then the pipeline can operate close to its full throughput. Cons: Implementa8on is not as straighqorward and requires flushing if decoding the branch address takes more than 1 clock cycle.

27 3 Always take the branch. The instruc8on in the pipeline is flushed if it is determined that the branch should have been taken aher the ID stage is carried out. JCD IF ID EX ME WB SUB IF ID EX ME WB IOR IF ID EX ME WB XOR IF ID EX ME WB address computa8on XOR instruc8on is always fetched and then either it is decoded and executed next or it is flushed and SUB is fetched and executed. Extra clock cycle is needed to set the PC to PC+1 during the EX cycle (because it was altered during the ID step to point to the XOR instruc8on) in case SUB must be fetched.

28 Pipeline Execu8on Speed (Always take the branch case): Assuming only branch hazards, we can compute the average number of clock cycles per instruc8on (CPI) as CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruc8on = 1 + branch penalty branch frequency predic8on rate + branch penalty branch frequency mispredic8on rate = 1 + branch frequency predic8on rate + 2 branch frequency mispredic8on rate Pros: No clear advantage other than it is beier suited for the execu8on of typical loops without the compiler's interven8on (but this can generally be overcome, see the next slide). Cons: Implementa8on is not as straighqorward, and has a higher mispredic8on penalty and overall expected CPI which is worse than the stall method.

29 Example: for (i = 0; i < 10; i++) a[i] = a[i] + 1; Branch always will not work well without compiler s help CLR R0; loop: JCD R0 >=10,exit LDD R1,R0; ADD R1,1; ST+ R1,R0; JMP D,loop; exit: Branch always will work well without compiler s help CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop;

30 3 Delayed branch: Insert an instruc8on aher a branch instruc8on, and always execute it whether or not the branch condi8on applies. Of course, this must be an instruc8on that can be executed without any side effects on the correctness of the program. Pros: Pipeline is never stalled or flushed and with the correct choice branch delayed slot instruc8on, performance can approach that of an ideal pipeline. Cons: It is not always possible to find a delayed slot instruc8on in which case a NOP instruc8on may have to be inserted into the delayed slot to make sure that the program's integrity is not violated. It makes compilers work harder.

31 Which instruc8on to place into the delayed branch slot? 3.1 Choose an instruc8on before the branch, but make sure that branch does not depend on moved instruc8on. If such an instruc8on can be found, this always pays off. Example: ADD R1,R2; JCD R2>10,exit; can be rescheduled as JCD R2,>,10,exit; ADD R1,R2; (Delay slot)

32 3.2 Choose an instruc8on from the target of the branch, but make sure that the moved instruc8on is executable when the branch is not taken. Example: ADD R1,R2; JCD R2 > 10,sub; JMP D, add;. sub: SUB R4,R5; add: ADI R3,5; can be rescheduled as ADD R1,R2; JCD R2,>,10,sub; ADI R3,5; (Delay slot). sub: SUB R4,R5;

33 3.3 Choose an instruc8on from the an8 target of the branch, but make sure that the moved instruc8on is executable when the branch is taken. Example: ADD R1,R2; JCD R2 > 10,exit; ADD R3,R2; exit: SUB R4,R5; // ADD R4,R3; can be rescheduled as ADD R1,R2; JCD R2,>,10,exit; ADD R3,R2; (Schedule for execu8on if it does not alter the program flow or output) exit: SUB R4,R5;

34 Dynamic Branch Predic6on Dynamic branch predic8on relies on the history of how branch condi8ons were resolved in the past. History of branches is kept in a buffer. To keep this buffer reasonably small and easy to access, the buffer is indexed by some fixed number of lower order bits of the address of the branch instruc8on in the program space. Assump8on is that the address values in the lower address field are unique enough to prevent frequent collisions or overrides. Thus if we are trying to predict branches in a program which remains within a block of 256 loca8ons, 8 bits should suffice. x x+1 x+256 JCD.. JCD

35 Branch instruc8ons in the instruc8on cache include a branch predic8on field that is used to predict if the branch should be taken. Memory Location Program Branch prediction field x Branch instruction 0 (branch was not taken) x+4 x+8 Branch instruction 0 (branch was not taken) x+12 x+16 x+20 Branch instruction 1 (branch was taken)

36 Branch predic8on: In the simplest case, the field is a 1 bit tag: 0 <=> branch was not taken last 8me (State A) 1 <=> branch was taken last 8me (State B) not taken taken taken A B not taken While in state A predict the branch as not to be taken While in state B predict the branch as to be taken

37 This works rela8vely well: It accurately predicts the branches in loops in all but two of the itera8ons CLR R0; not taken taken loop: LDD R1,R0; taken ADD R1,1; A B ST+ R1,R0; JCD R0 < 10,loop; not taken Assuming that we begin in state A, predic8on fails when R0 = 1 (branch is not taken when it should be) and R0 =10(branch is taken when it should not be) Assuming that we begin in state B, predic8on fails when R0 =10 (branch is taken when it should not be)

38 We can modify the loop to make the branch predic8on algorithm fail twice when we begin in state B as well. CLR R0; loop:ldd R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 >=10,exit; JMP D,loop; exit: not taken taken A B not taken taken Assuming that we begin in state B, predic8on fails: when R0 = 1 (branch is taken when it should not be) and R0 =10(branch is not taken when it should not be)

39 What is worse is that we can make this branch predic8on algorithm fail each 8me it makes a predic8on: LDI R0,1; loop: JCD R0 > 0,neg; LDI R0,1; JMP D,loop; neg: LDI R0, 1; JMP D,loop; not taken taken A B not taken taken Assuming that we begin in state A, predic8on fails when R0 = 1 (branch is not taken when it should be) R0 = 1 (branch is taken when it should not be) R0 = 1 (branch is not taken when it should be) R0 = 1 (branch is taken when it should not be) and so on

40 2 bit predic8on ( A more reluctant flip in decision ) not taken taken A1 A2 not taken not taken taken taken taken B2 B1 not taken While in states A1 and A2 predict the branch as not to be taken While in states B1 and B2 predict the branch as to be taken

41 CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop; not taken not taken Assuming that we begin in state A1, predic8on fails when R0 = 1,2 (branch is not taken when it should be) and R0 = 10 (branch is taken when it should not be) Assuming that we begin in state B1, predic8on fails when R0 = 10 (branch is taken when it should not be) taken A1 A2 not taken taken B2 B1 not taken taken taken

42 2 bit predictors are more resilient to branch inversions (predic8ons are reversed when they are missed twice): LDI R0,1; not taken taken loop: JCD R0 > 0,neg; A1 A2 LDI R0,1; not taken JMP D,loop; neg: LDI R0, 1; not taken JMP D,loop; Assuming that we begin in state B1, predic8on succeeds when R0 = 1 (branch is taken when it should be) fails when R0 = 1 (branch is taken when it should not be) succeeds when R0 = 1 (branch is taken when it should be) fails when R0 = 1 (branch is taken when it should not be) and so on taken B2 B1 not taken taken taken

43 Amdahl's Law (Fixed Load Speed up) Let q be the frac8on of a load L that cannot be speeded up by introducing more processors and let T(p) be the amount 8me it takes to execute L on p processors by a linear work func8on, p > 1. Then T( p) > qt(1) + (1 q)t(1) p S( p) = T(1) T(p) < 1 q + 1 q 1 q as p p All this means is that, the maximum speed up of a system is limited by the frac8on of the work that must be completed sequen8ally. Thus, the execu8on of the work using p processors can be reduced to qt(1) under the best of circumstances, and the speed up cannot exceed 1/q.

44 Example A 4 processor computer executes instruc8ons that are fetched from a random access memory over a shared bus as shown below:

45 The task to be performed is divided into two parts: 1. Fetch instruc8on (serial part) it takes 30 microseconds 2. Execute instruc8on (parallel part) it takes 10 microseconds to execute: S(4) = T(1)/T(4) = 1/( /4) = 4/3.25 = 1.23 microseconds microseconds microseconds microseconds

46 Now, suppose that the number of processors is doubled. Then S(8) = T(1)/T(8) = 1/( /8) = 8/6.25 = 1.28 Suppose that the number of processors is doubled again. Then S(16) = T(1)/T(16) = 1/( /16) = 16/12.25 = 1.30.

47 What is the limit S(p) = T(1)/T(p) = 1/( /p) = 1/0.75 =

48 Alternate Forms of Amdahl's Law S = T(1) T unenhanced + T enhanced = T(1) T(1)(q + 1 q s ) 1 q as s. where s is the speed up of the computa8on that can be enhanced.

49 Example: Suppose that you've upgraded your computer from a 2 GHz processor to a 4 GHz processor. What is the maximum speed up you expect in execu8ng a typical program assuming that (1) the speed of fetching each instruc8on is directly propor8onal to the speed of reading an instruc8on from the primary memory of your computer, and reading an instruc8on takes four 8mes longer than execu8ng it, (2) the speed of execu8ng each instruc8on is directly propor8onal to the clock speed of the processor of your computer? Using Amdahl's Law with q = 0.8 and s = 2, we have S = 2 /( x 2) = Very disappoin8ng as you are likely to have paid quite a bit of money for the upgrade!

50 Generalized Amdahl's Law In general, a task may be par88oned into a set of subtasks, with each subtask requiring a designated number of processors to execute. In this case, the speed up of the parallel execu8on of the task over its sequen8al execu8on can be characterized by the following, more general formula: T(1) S( p 1, p 2,, p k ) = T(p 1, p 2,, p k ) T(1) < q 1 T(1) + q T(1) 2 p 1 p 2 where q 1 + q q k = q k T(1) p k = 1 q 1 p 1 + q 2 p q k p k When k = 2, q 1 = q, q 2 = 1 q, p 1 = 1, p 2 = p, this formula reduces to Amdahl's Law.

51 Remark: The generalized Amdahl's Law can also be rewriien to express the speed up due to different amounts of speed enhancement (S e ) that can be made to different parts of a system: S e (s 1,s 2,,s k ) = T(1) T(s 1,s 2,,s k ) T(1) = q 1 T(1) + q 2T(1) s 1 s 2 where q 1 + q q k = q kt(1) s k < 1 q 1 s 1 + q 2 s q k s k

52 Example: Suppose that your computer executes a program that has the following profile of execu8on: (a) 30% integer opera8ons, (b) 20% floa8ng point opera8ons, (c) 50% memory reference instruc8ons How much speed up will you expect if you double the speed of the floa8ng unit of your computer?using the formula above: S e =1/( / ) = 1.1

53 Example: Suppose that you have a fixed budget of $500 to upgrade each of the computers in your laboratory, and you find out that the computa8ons you perform on your computers require (a) 40% integer opera8ons, (b) 60% floa8ng point opera8ons, If every dollar spent on the integer unit aher $50 decreases its execu8on 8me by 2%, and if every dollar spent on the floa8ng point unit aher $100 decreases its execu8on 8me by 1%, how would you spend the $500?

54 Example (Con6nued): S = T(1) T i (x 1 ) + T f (x 2 ) where x 1 + x 2 = 350 T i (x 1 ) = (1 0.02)T i (x 1 1) T i (x 1 ) = 0.98 x 1 T i (0) T f (x 2 ) = (1 0.01)T f (x 2 1) T f (x 2 ) = 0.99 x 2 T f (0) T i (0) = 0.4T(1) T f (0) = 0.6T(1) Subs8tu8ng these into the generalized Amdahl's speed up expression gives: T(1) S = 0.98 x T(1) x T(1) 1 = 0.98 x x 2 0.6

55 Example 8 (Con6nued): So we maximize x x subject to x 1 + x 2 = 350, or maximize x x subject to x 1 < 350.

56 Example (Con6nued): Compu8ng the values in the neighborhood of 120 reveals that the speed up is maximized when x 1 = 126. From Mathema8ca: Table[1/ (0.4 * 0.98^x * 0.99 ^(350 x)),{ x, 120,128,1}] { , , , , , , , ,10.574} Note: It is possible to have higher speed up with all of the money invested in one of the units if the fix cost for one of the units becomes sufficiently large.

57 Addendum: If the changes in performance due to upgrades are specified in terms of speed rather than 8me, we can then use the following formula8on: t = L s Δt Δx = Δt Δs Δs Δx = L Δs s 2 Δx = L s Δs = t Δs s s Δs s 1 Δx Δt = L s Δt =T(x) T(x 1) = T(x 1) Δs s T(x) = (1 Δs )T(x 1) s Δs where denotes the percentage change in speed. s

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)