ENEE350 Lecture Notes-Weeks 14 and 15

Size: px
Start display at page:

Download "ENEE350 Lecture Notes-Weeks 14 and 15"

Transcription

1 Pipelining & Amdahl s Law ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining is a method of processing in which a problem is divided into a number of sub problems and solved and the solu8ons of the sub problems for different instances of the problem are then overlapped.

2 Example: a[i] = b[i] + c[i] + d[i] + e[i] + f[i], i = 1, 2, 3,,n c[2] c[1] f[2] f[1] e[2] D e[1] d[2] D D d[1] D D D b[1] a[2] a[1] Adders have delay D to compute. Computation time = 4D + (n-1)d = nd +3D Speed-up = 4nD/{3D + nd} -> 4 for large n.

3 We can describe the computa8on process in an n segment pipeline algorithmically. There are three dis8nct phases to this computa8on: (a) filling the pipeline, (b) running the pipeline in the filled state un8l the last input arrives, and (c) emptying the pipeline.

4 Example: Pipelined Ripple Adder u[m 1n 11], v[m 1,n 1] u[m 1,1], v[m 1,1] u[1,n 1], v[1,n 1] u[0,n 1], v[0,n 1] u[m 1,0], v[m 1,0] u[1,0], v[1,0] u[1,1], v[1,1] u[0,1], v[0,1] u[0,0], v[0,0] u[m 1:0 n 1 + v[m 1:0 n 1] u[1:0 n 1] + v[1:0 n 1] u[0:0 n 1] + v[0:0 n 1] 0 FA 0 D FA0 D FA n 1 D clock

5 Instruc6on pipelines: Goal: (i) to increase the throughput (number of instruc8ons/sec) in execu8ng programs (ii) to reduce the execu8on 8me (clock cycles/instruc8on, etc). clock 0 fetch decode execute I 1 1 I 2 I 1 2 I 3 I 2 I 1 3 I 4 I 3 I 2 4 I 4 I 3

6 A 5 stage (MIPS) pipeline clock fetch decode execute memory write back 0 I 1 1 I 2 I 1 2 I 3 I 2 I 1 3 I 4 I 3 I 2 I 1 4 I 5 I 4 I 3 I 2 I 1

7 Speed up of pipelined execu6on of instruc6ons over a sequen6al execu6on: S(5) = T 1 T p = CPI un u / f u CPI p N p / f p N u : The number of instruc8ons executed by serial system N p : The number of instruc8ons executed by pipeline system CPI u : Number of clock cycles per instruc8on for serial system CPI p : Number of clock cycles per instruc8on for pipeline system f u : Clock frequency of serial system f p : Clock frequency of pipeline system Assuming that the serial and pipeline systems both operate at the same clock rate and use the same number of opera8ons: S(5) = CPI u CPI p

8 Example Suppose that the instruc8on mix of programs executed on a serial and pipeline machines is 40% ALU, 20% branching, and 40% memory with 4, 2, and 4 cycles per each instruc8on in the three classes respec8vely. Then, under ideal condi8ons (no stalls due to hazards) S(5) = CPI u CPI p = = 3.3 If, the clock speed needs to be increased for the pipeline implementa8on then the speed up will have to be scaled down accordingly using the formula on the previous slide.

9 Instruction Pipelines MIPS (Hennessy & Patterson)

10 MIPS Pipeline IF ID EX WB Register operations IF ID EX ME WB Register/Memory operations

11 Hazards 1 Structural Hazards 2 Data Hazards 3 Control Hazards

12 Structural Hazards: They arise when limited resources are scheduled to operate concurrently on different streams during the same clock period. Example: Memory conflict (data fetch + instruc8on fetch) or datapath conflict (arithme8c opera8on + PC update) Clock IF ID EX ME WB 0 I 1 1 I 2 I 1 2 I 3 I 2 I 1 3 I 4 I 3 I 2 I 1 4 I 5 I 4 I 3 I 2 I 1 5 I 6 I 5 I 4 I 3 I 2 6 I 7 I 6 I 5 I 4 I 3

13 Fix: Duplicate hardware (too expensive) Stall the pipeline (serialize the opera8on) (too slow) Clock IF ID EX ME WB 0 I 1 1 I 2 I 1 2 I 2 I 1 3 I 2 I 1 4 I 3 I 2 I 1 5 I 4 I 3 I 2 6 I 4 I 3 7 I 4 I 3 8 I 5 I 4 I 3 9 I 6 I 5 I 4

14 Speed up = T serial /T pipeline = 5nt s / {2nt s + 2t s }, for odd n = 5nt s / {2nt s + 3t s }, for even n > 5/2 as the number of instruc8ons, n, tends to infinity. Thus, we loose half the throughput due to stalls. Note: The pipeline 8me of execu8on can be computed using the recurrences T 1 = 4 T i = T i for even i T i = T i for odd i T 1 = 4, T 2 = 4 +1 = 5, T 3 = 5+3 = 8, T 4 = 8 +1 = 9, T 5 = 9+3 = 12, T 6 = = 13, T n = 2n + 2, T n+1 = 2n = 2n + 3

15 Data Hazards They occur when the execu8ons of two instruc8ons may result in the incorrect reading of operands and/or wri8ng of a result. Read Aher Write (RAW) Hazard (Data Dependency) Write Aher Read Hazard (WAR) (Data An8 dependency) Write Aher Write Hazard (WAW) (Data An8 dependency)

16 RAW Hazards They occur when reads are early and writes are late. Clock IF ID EX ME WB 0 I 1 1 I 2 I 1 2 I 3 I 2 I 1 3 I 4 I 3 Read I 1 4 I 5 I 4 I 3 I 2 Write 5 I 6 I 5 I 4 I 3 I 2 6 I 7 I 6 I 5 I 4 I 3 I 2 : R 3 = R 1 + R 2 I 1 : R 1 = R 1 + R 2

17 RAW Hazards (Cont d) They can be avoided by stalling the reads but this increases the execu8on 8me. A beier approach is to use data forwarding: Clock IF ID EX ME WB 0 I 1 1 I 2 I 1 2 I 3 I 2 I 1 3 I 4 I 3 Read I 2 4 I 5 I 4 I 3 I 2 Write I 1 5 I 6 I 5 I 4 I 3 I 2 6 I 7 I 6 I 5 I 4 I 3 I 2 : R 3 = R 1 + R 2 I 1 : R 1 = R 1 + R 2

18 WAR Hazards They occur when writes are early and reads are late Clock IF ID EX ME WB EX ME WB 0 I 1 1 I 2 I 1 2 I 3 I 2 I 1 3 I 4 I 3 I 2 I 1 4 I 5 I 4 I 3 I 2 I 1 5 I 6 I 5 I 4 I 3 Write Read 6 I 7 I 6 I 5 I 4 I 3 I 2 I 1 I 4 I 3 I 2 I 1 I 2 : R 3 = R 7 + R 5 ; R 6 = R 2 + R 8 I 1 : R 2 = R 2 + R 3 ; R 9 = R 3 + R 4

19 Branch Prediction in Pipeline Instruction Sequencing One of the major issues in pipelined instruc8on processing is to schedule condi8onal branch instruc8ons. When a pipeline controller encounters a condi8onal branch instruc8on it has a choice to decode it into one of two instruc8on streams. If the branch condi8on is met then the execu8on con8nues from the target of the condi8onal branch instruc8on; Otherwise, it con8nues with the instruc8on that follows the condi8onal branch instruc8on. As there are other instruc8ons moving behind a condi8onal branch instruc8on, it is necessary to have a system which can flush the pipeline in case the branch condi8on is mispredicted.

20 Example: Suppose that we execute the following assembly code on a 5 stage pipeline (IF, ID,EX,ME, WB): LDI R0 = 20; JCD R0 < 10, add; SUB R0,R1; JMP D,halt; add: ADD R0,R1; halt: HLT; If we assume that R0 < 10 then the SUB instruc8on would have been incorrectly fetched during the second clock cycle. and we will have to execute another fetch cycle to fetch the ADD instruc8on.

21 Classifica6on of branch predic6on algorithms Sta:c Branch Predic:on: The branch decision does not change over 8me we use a fixed branching policy. Dynamic Branch Predic:on: The branch decision does change over 8me we use a branching policy that varies over 8me.

22 Sta6c Branch Predic6on Algorithms 1 Don t predict (stall the pipeline) 2 Never take the branch 3 Always take the branch 4 Delayed branch

23 1 Stall the pipeline by 1 clock cycle : This allows us to determine the target of the branch instruc8on. JCD IF ID EX ME WB SUB IF ID EX ME WB ADD IF ID EX ME WB Pipeline proceeds with one of the Instruc8ons Stall and decide the branch.

24 Pipeline Execu8on Speed (stall case): Assuming only branch hazards, we can compute the average number of clock cycles per instruc8on (CPI) as CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruc8on = 1 + branch penalty branch frequency = 1 + branch frequency In general, CPI of the pipeline > 1 + branch frequency because of data and possibly structural hazards Pros: Straighqorward to implement Cons: The 8me overhead is high when the instruc8on mix includes a high percentage of branch instruc8ons.

25 2 Never take the branch. The instruc8on in the pipeline is flushed if it is determined that the branch should have been taken aher the ID stage is carried out. JCD IF ID EX ME WB SUB IF ID EX ME WB IOR IF ID EX ME WB Execute this if the branch fails XOR IF ID EX ME WB SUB instruc8on is always fetched and then either it is decoded and executed next or it is flushed and XOR is fetched and executed.

26 Pipeline Execu8on Speed (Never take the branch case): Assuming only branch hazards, we can compute the average number of clock cycles per instruc8on (CPI) as CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruc8on = 1 + branch penalty branch frequency mispredic8on rate = 1 + branch frequency mispredic8on rate Pros: If the predic8on is highly accurate then the pipeline can operate close to its full throughput. Cons: Implementa8on is not as straighqorward and requires flushing if decoding the branch address takes more than 1 clock cycle.

27 3 Always take the branch. The instruc8on in the pipeline is flushed if it is determined that the branch should have been taken aher the ID stage is carried out. JCD IF ID EX ME WB SUB IF ID EX ME WB IOR IF ID EX ME WB XOR IF ID EX ME WB address computa8on XOR instruc8on is always fetched and then either it is decoded and executed next or it is flushed and SUB is fetched and executed. Extra clock cycle is needed to set the PC to PC+1 during the EX cycle (because it was altered during the ID step to point to the XOR instruc8on) in case SUB must be fetched.

28 Pipeline Execu8on Speed (Always take the branch case): Assuming only branch hazards, we can compute the average number of clock cycles per instruc8on (CPI) as CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruc8on = 1 + branch penalty branch frequency predic8on rate + branch penalty branch frequency mispredic8on rate = 1 + branch frequency predic8on rate + 2 branch frequency mispredic8on rate Pros: No clear advantage other than it is beier suited for the execu8on of typical loops without the compiler's interven8on (but this can generally be overcome, see the next slide). Cons: Implementa8on is not as straighqorward, and has a higher mispredic8on penalty and overall expected CPI which is worse than the stall method.

29 Example: for (i = 0; i < 10; i++) a[i] = a[i] + 1; Branch always will not work well without compiler s help CLR R0; loop: JCD R0 >=10,exit LDD R1,R0; ADD R1,1; ST+ R1,R0; JMP D,loop; exit: Branch always will work well without compiler s help CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop;

30 3 Delayed branch: Insert an instruc8on aher a branch instruc8on, and always execute it whether or not the branch condi8on applies. Of course, this must be an instruc8on that can be executed without any side effects on the correctness of the program. Pros: Pipeline is never stalled or flushed and with the correct choice branch delayed slot instruc8on, performance can approach that of an ideal pipeline. Cons: It is not always possible to find a delayed slot instruc8on in which case a NOP instruc8on may have to be inserted into the delayed slot to make sure that the program's integrity is not violated. It makes compilers work harder.

31 Which instruc8on to place into the delayed branch slot? 3.1 Choose an instruc8on before the branch, but make sure that branch does not depend on moved instruc8on. If such an instruc8on can be found, this always pays off. Example: ADD R1,R2; JCD R2>10,exit; can be rescheduled as JCD R2,>,10,exit; ADD R1,R2; (Delay slot)

32 3.2 Choose an instruc8on from the target of the branch, but make sure that the moved instruc8on is executable when the branch is not taken. Example: ADD R1,R2; JCD R2 > 10,sub; JMP D, add;. sub: SUB R4,R5; add: ADI R3,5; can be rescheduled as ADD R1,R2; JCD R2,>,10,sub; ADI R3,5; (Delay slot). sub: SUB R4,R5;

33 3.3 Choose an instruc8on from the an8 target of the branch, but make sure that the moved instruc8on is executable when the branch is taken. Example: ADD R1,R2; JCD R2 > 10,exit; ADD R3,R2; exit: SUB R4,R5; // ADD R4,R3; can be rescheduled as ADD R1,R2; JCD R2,>,10,exit; ADD R3,R2; (Schedule for execu8on if it does not alter the program flow or output) exit: SUB R4,R5;

34 Dynamic Branch Predic6on Dynamic branch predic8on relies on the history of how branch condi8ons were resolved in the past. History of branches is kept in a buffer. To keep this buffer reasonably small and easy to access, the buffer is indexed by some fixed number of lower order bits of the address of the branch instruc8on in the program space. Assump8on is that the address values in the lower address field are unique enough to prevent frequent collisions or overrides. Thus if we are trying to predict branches in a program which remains within a block of 256 loca8ons, 8 bits should suffice. x x+1 x+256 JCD.. JCD

35 Branch instruc8ons in the instruc8on cache include a branch predic8on field that is used to predict if the branch should be taken. Memory Location Program Branch prediction field x Branch instruction 0 (branch was not taken) x+4 x+8 Branch instruction 0 (branch was not taken) x+12 x+16 x+20 Branch instruction 1 (branch was taken)

36 Branch predic8on: In the simplest case, the field is a 1 bit tag: 0 <=> branch was not taken last 8me (State A) 1 <=> branch was taken last 8me (State B) not taken taken taken A B not taken While in state A predict the branch as not to be taken While in state B predict the branch as to be taken

37 This works rela8vely well: It accurately predicts the branches in loops in all but two of the itera8ons CLR R0; not taken taken loop: LDD R1,R0; taken ADD R1,1; A B ST+ R1,R0; JCD R0 < 10,loop; not taken Assuming that we begin in state A, predic8on fails when R0 = 1 (branch is not taken when it should be) and R0 =10(branch is taken when it should not be) Assuming that we begin in state B, predic8on fails when R0 =10 (branch is taken when it should not be)

38 We can modify the loop to make the branch predic8on algorithm fail twice when we begin in state B as well. CLR R0; loop:ldd R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 >=10,exit; JMP D,loop; exit: not taken taken A B not taken taken Assuming that we begin in state B, predic8on fails: when R0 = 1 (branch is taken when it should not be) and R0 =10(branch is not taken when it should not be)

39 What is worse is that we can make this branch predic8on algorithm fail each 8me it makes a predic8on: LDI R0,1; loop: JCD R0 > 0,neg; LDI R0,1; JMP D,loop; neg: LDI R0, 1; JMP D,loop; not taken taken A B not taken taken Assuming that we begin in state A, predic8on fails when R0 = 1 (branch is not taken when it should be) R0 = 1 (branch is taken when it should not be) R0 = 1 (branch is not taken when it should be) R0 = 1 (branch is taken when it should not be) and so on

40 2 bit predic8on ( A more reluctant flip in decision ) not taken taken A1 A2 not taken not taken taken taken taken B2 B1 not taken While in states A1 and A2 predict the branch as not to be taken While in states B1 and B2 predict the branch as to be taken

41 CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop; not taken not taken Assuming that we begin in state A1, predic8on fails when R0 = 1,2 (branch is not taken when it should be) and R0 = 10 (branch is taken when it should not be) Assuming that we begin in state B1, predic8on fails when R0 = 10 (branch is taken when it should not be) taken A1 A2 not taken taken B2 B1 not taken taken taken

42 2 bit predictors are more resilient to branch inversions (predic8ons are reversed when they are missed twice): LDI R0,1; not taken taken loop: JCD R0 > 0,neg; A1 A2 LDI R0,1; not taken JMP D,loop; neg: LDI R0, 1; not taken JMP D,loop; Assuming that we begin in state B1, predic8on succeeds when R0 = 1 (branch is taken when it should be) fails when R0 = 1 (branch is taken when it should not be) succeeds when R0 = 1 (branch is taken when it should be) fails when R0 = 1 (branch is taken when it should not be) and so on taken B2 B1 not taken taken taken

43 Amdahl's Law (Fixed Load Speed up) Let q be the frac8on of a load L that cannot be speeded up by introducing more processors and let T(p) be the amount 8me it takes to execute L on p processors by a linear work func8on, p > 1. Then T( p) > qt(1) + (1 q)t(1) p S( p) = T(1) T(p) < 1 q + 1 q 1 q as p p All this means is that, the maximum speed up of a system is limited by the frac8on of the work that must be completed sequen8ally. Thus, the execu8on of the work using p processors can be reduced to qt(1) under the best of circumstances, and the speed up cannot exceed 1/q.

44 Example A 4 processor computer executes instruc8ons that are fetched from a random access memory over a shared bus as shown below:

45 The task to be performed is divided into two parts: 1. Fetch instruc8on (serial part) it takes 30 microseconds 2. Execute instruc8on (parallel part) it takes 10 microseconds to execute: S(4) = T(1)/T(4) = 1/( /4) = 4/3.25 = 1.23 microseconds microseconds microseconds microseconds

46 Now, suppose that the number of processors is doubled. Then S(8) = T(1)/T(8) = 1/( /8) = 8/6.25 = 1.28 Suppose that the number of processors is doubled again. Then S(16) = T(1)/T(16) = 1/( /16) = 16/12.25 = 1.30.

47 What is the limit S(p) = T(1)/T(p) = 1/( /p) = 1/0.75 =

48 Alternate Forms of Amdahl's Law S = T(1) T unenhanced + T enhanced = T(1) T(1)(q + 1 q s ) 1 q as s. where s is the speed up of the computa8on that can be enhanced.

49 Example: Suppose that you've upgraded your computer from a 2 GHz processor to a 4 GHz processor. What is the maximum speed up you expect in execu8ng a typical program assuming that (1) the speed of fetching each instruc8on is directly propor8onal to the speed of reading an instruc8on from the primary memory of your computer, and reading an instruc8on takes four 8mes longer than execu8ng it, (2) the speed of execu8ng each instruc8on is directly propor8onal to the clock speed of the processor of your computer? Using Amdahl's Law with q = 0.8 and s = 2, we have S = 2 /( x 2) = Very disappoin8ng as you are likely to have paid quite a bit of money for the upgrade!

50 Generalized Amdahl's Law In general, a task may be par88oned into a set of subtasks, with each subtask requiring a designated number of processors to execute. In this case, the speed up of the parallel execu8on of the task over its sequen8al execu8on can be characterized by the following, more general formula: T(1) S( p 1, p 2,, p k ) = T(p 1, p 2,, p k ) T(1) < q 1 T(1) + q T(1) 2 p 1 p 2 where q 1 + q q k = q k T(1) p k = 1 q 1 p 1 + q 2 p q k p k When k = 2, q 1 = q, q 2 = 1 q, p 1 = 1, p 2 = p, this formula reduces to Amdahl's Law.

51 Remark: The generalized Amdahl's Law can also be rewriien to express the speed up due to different amounts of speed enhancement (S e ) that can be made to different parts of a system: S e (s 1,s 2,,s k ) = T(1) T(s 1,s 2,,s k ) T(1) = q 1 T(1) + q 2T(1) s 1 s 2 where q 1 + q q k = q kt(1) s k < 1 q 1 s 1 + q 2 s q k s k

52 Example: Suppose that your computer executes a program that has the following profile of execu8on: (a) 30% integer opera8ons, (b) 20% floa8ng point opera8ons, (c) 50% memory reference instruc8ons How much speed up will you expect if you double the speed of the floa8ng unit of your computer?using the formula above: S e =1/( / ) = 1.1

53 Example: Suppose that you have a fixed budget of $500 to upgrade each of the computers in your laboratory, and you find out that the computa8ons you perform on your computers require (a) 40% integer opera8ons, (b) 60% floa8ng point opera8ons, If every dollar spent on the integer unit aher $50 decreases its execu8on 8me by 2%, and if every dollar spent on the floa8ng point unit aher $100 decreases its execu8on 8me by 1%, how would you spend the $500?

54 Example (Con6nued): S = T(1) T i (x 1 ) + T f (x 2 ) where x 1 + x 2 = 350 T i (x 1 ) = (1 0.02)T i (x 1 1) T i (x 1 ) = 0.98 x 1 T i (0) T f (x 2 ) = (1 0.01)T f (x 2 1) T f (x 2 ) = 0.99 x 2 T f (0) T i (0) = 0.4T(1) T f (0) = 0.6T(1) Subs8tu8ng these into the generalized Amdahl's speed up expression gives: T(1) S = 0.98 x T(1) x T(1) 1 = 0.98 x x 2 0.6

55 Example 8 (Con6nued): So we maximize x x subject to x 1 + x 2 = 350, or maximize x x subject to x 1 < 350.

56 Example (Con6nued): Compu8ng the values in the neighborhood of 120 reveals that the speed up is maximized when x 1 = 126. From Mathema8ca: Table[1/ (0.4 * 0.98^x * 0.99 ^(350 x)),{ x, 120,128,1}] { , , , , , , , ,10.574} Note: It is possible to have higher speed up with all of the money invested in one of the units if the fix cost for one of the units becomes sufficiently large.

57 Addendum: If the changes in performance due to upgrades are specified in terms of speed rather than 8me, we can then use the following formula8on: t = L s Δt Δx = Δt Δs Δs Δx = L Δs s 2 Δx = L s Δs = t Δs s s Δs s 1 Δx Δt = L s Δt =T(x) T(x 1) = T(x 1) Δs s T(x) = (1 Δs )T(x 1) s Δs where denotes the percentage change in speed. s

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

ICS 233 Computer Architecture & Assembly Language

ICS 233 Computer Architecture & Assembly Language ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by

More information

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control The MIPS Pipeline CSCI206 - Computer Organization & Programming Pipeline Datapath and Control zybook: 11.6 Developed and maintained by the Bucknell University Computer Science Department - 2017 Hazard

More information

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2

Pipelining. Traditional Execution. CS 365 Lecture 12 Prof. Yih Huang. add ld beq CS CS 365 2 Pipelining CS 365 Lecture 12 Prof. Yih Huang CS 365 1 Traditional Execution 1 2 3 4 1 2 3 4 5 1 2 3 add ld beq CS 365 2 1 Pipelined Execution 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

More information

CSCI-564 Advanced Computer Architecture

CSCI-564 Advanced Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA

More information

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference) ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC

More information

4. (3) What do we mean when we say something is an N-operand machine?

4. (3) What do we mean when we say something is an N-operand machine? 1. (2) What are the two main ways to define performance? 2. (2) When dealing with control hazards, a prediction is not enough - what else is necessary in order to eliminate stalls? 3. (3) What is an "unbalanced"

More information

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 2. (2 )What are the two main ways to define performance? 3. (2 )What

More information

Unit 6: Branch Prediction

Unit 6: Branch Prediction CIS 501: Computer Architecture Unit 6: Branch Prediction Slides developed by Joe Devie/, Milo Mar4n & Amir Roth at Upenn with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi,

More information

Portland State University ECE 587/687. Branch Prediction

Portland State University ECE 587/687. Branch Prediction Portland State University ECE 587/687 Branch Prediction Copyright by Alaa Alameldeen and Haitham Akkary 2015 Branch Penalty Example: Comparing perfect branch prediction to 90%, 95%, 99% prediction accuracy,

More information

Computer Architecture ELEC2401 & ELEC3441

Computer Architecture ELEC2401 & ELEC3441 Last Time Pipeline Hazard Computer Architecture ELEC2401 & ELEC3441 Lecture 8 Pipelining (3) Dr. Hayden Kwok-Hay So Department of Electrical and Electronic Engineering Structural Hazard Hazard Control

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting

More information

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction

Pipeline no Prediction. Branch Delay Slots A. From before branch B. From branch target C. From fall through. Branch Prediction Pipeline no Prediction Branching completes in 2 cycles We know the target address after the second stage? PC fetch Instruction Memory Decode Check the condition Calculate the branch target address PC+4

More information

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT CSE 560 Practice Problem Set 4 Solution 1. In this question, you will examine several different schemes for branch prediction, using the following code sequence for a simple load store ISA with no branch

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

3. (2) What is the difference between fixed and hybrid instructions?

3. (2) What is the difference between fixed and hybrid instructions? 1. (2 pts) What is a "balanced" pipeline? 2. (2 pts) What are the two main ways to define performance? 3. (2) What is the difference between fixed and hybrid instructions? 4. (2 pts) Clock rates have grown

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] What are the two main ways to define performance? [2] Predicting the direction of a branch is not enough. What else is necessary? [2] The power consumed by a chip has increased over time, but the clock

More information

Fall 2011 Prof. Hyesoon Kim

Fall 2011 Prof. Hyesoon Kim Fall 2011 Prof. Hyesoon Kim Add: 2 cycles FE_stage add r1, r2, r3 FE L ID L EX L MEM L WB L add add sub r4, r1, r3 sub sub add add mul r5, r2, r3 mul sub sub add add mul sub sub add add mul sub sub add

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper

More information

CMP 334: Seventh Class

CMP 334: Seventh Class CMP 334: Seventh Class Performance HW 5 solution Averages and weighted averages (review) Amdahl's law Ripple-carry adder circuits Binary addition Half-adder circuits Full-adder circuits Subtraction, negative

More information

CPSC 3300 Spring 2017 Exam 2

CPSC 3300 Spring 2017 Exam 2 CPSC 3300 Spring 2017 Exam 2 Name: 1. Matching. Write the correct term from the list into each blank. (2 pts. each) structural hazard EPIC forwarding precise exception hardwired load-use data hazard VLIW

More information

/ : Computer Architecture and Design

/ : Computer Architecture and Design 16.482 / 16.561: Computer Architecture and Design Summer 2015 Homework #5 Solution 1. Dynamic scheduling (30 points) Given the loop below: DADDI R3, R0, #4 outer: DADDI R2, R1, #32 inner: L.D F0, 0(R1)

More information

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide) Out-of-order Pipeline Buffer of instructions Issue = Select + Wakeup Select N oldest, read instructions N=, xor N=, xor and sub Note: ma have execution resource constraints: i.e., load/store/fp Fetch Decode

More information

Fall 2008 CSE Qualifying Exam. September 13, 2008

Fall 2008 CSE Qualifying Exam. September 13, 2008 Fall 2008 CSE Qualifying Exam September 13, 2008 1 Architecture 1. (Quan, Fall 2008) Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for

More information

Microprocessor Power Analysis by Labeled Simulation

Microprocessor Power Analysis by Labeled Simulation Microprocessor Power Analysis by Labeled Simulation Cheng-Ta Hsieh, Kevin Chen and Massoud Pedram University of Southern California Dept. of EE-Systems Los Angeles CA 989 Outline! Introduction! Problem

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. fetch decode & eg-fetch execute

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. I fetch decode & eg-fetch execute memory Clock period

More information

Computer Architecture

Computer Architecture Lecture 2: Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture CPU Evolution What is? 2 Outline Measurements and metrics : Performance, Cost, Dependability, Power Guidelines

More information

CS 52 Computer rchitecture and Engineering Lecture 4 - Pipelining Krste sanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste! http://inst.eecs.berkeley.edu/~cs52!

More information

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu Performance Metrics for Computer Systems CASS 2018 Lavanya Ramapantulu Eight Great Ideas in Computer Architecture Design for Moore s Law Use abstraction to simplify design Make the common case fast Performance

More information

Adders, subtractors comparators, multipliers and other ALU elements

Adders, subtractors comparators, multipliers and other ALU elements CSE4: Components and Design Techniques for Digital Systems Adders, subtractors comparators, multipliers and other ALU elements Adders 2 Circuit Delay Transistors have instrinsic resistance and capacitance

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018

ECE 172 Digital Systems. Chapter 12 Instruction Pipelining. Herbert G. Mayer, PSU Status 7/20/2018 ECE 172 Digital Systems Chapter 12 Instruction Pipelining Herbert G. Mayer, PSU Status 7/20/2018 1 Syllabus l Scheduling on Pipelined Architecture l Idealized Pipeline l Goal of Scheduling l Causes for

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 09/10, Jan., 2018 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 43 Most Essential Assumptions for Real-Time Systems Upper

More information

Adders, subtractors comparators, multipliers and other ALU elements

Adders, subtractors comparators, multipliers and other ALU elements CSE4: Components and Design Techniques for Digital Systems Adders, subtractors comparators, multipliers and other ALU elements Instructor: Mohsen Imani UC San Diego Slides from: Prof.Tajana Simunic Rosing

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

GATE 2014 A Brief Analysis (Based on student test experiences in the stream of CS on 1 st March, Second Session)

GATE 2014 A Brief Analysis (Based on student test experiences in the stream of CS on 1 st March, Second Session) GATE 4 A Brief Analysis (Based on student test experiences in the stream of CS on st March, 4 - Second Session) Section wise analysis of the paper Mark Marks Total No of Questions Engineering Mathematics

More information

COVER SHEET: Problem#: Points

COVER SHEET: Problem#: Points EEL 4712 Midterm 3 Spring 2017 VERSION 1 Name: UFID: Sign here to give permission for your test to be returned in class, where others might see your score: IMPORTANT: Please be neat and write (or draw)

More information

Goals for Performance Lecture

Goals for Performance Lecture Goals for Performance Lecture Understand performance, speedup, throughput, latency Relationship between cycle time, cycles/instruction (CPI), number of instructions (the performance equation) Amdahl s

More information

Lecture: Pipelining Basics

Lecture: Pipelining Basics Lecture: Pipelining Basics Topics: Performance equations wrap-up, Basic pipelining implementation Video 1: What is pipelining? Video 2: Clocks and latches Video 3: An example 5-stage pipeline Video 4:

More information

Project Two RISC Processor Implementation ECE 485

Project Two RISC Processor Implementation ECE 485 Project Two RISC Processor Implementation ECE 485 Chenqi Bao Peter Chinetti November 6, 2013 Instructor: Professor Borkar 1 Statement of Problem This project requires the design and test of a RISC processor

More information

CSE P 501 Compilers. Value Numbering & Op;miza;ons Hal Perkins Winter UW CSE P 501 Winter 2016 S-1

CSE P 501 Compilers. Value Numbering & Op;miza;ons Hal Perkins Winter UW CSE P 501 Winter 2016 S-1 CSE P 501 Compilers Value Numbering & Op;miza;ons Hal Perkins Winter 2016 UW CSE P 501 Winter 2016 S-1 Agenda Op;miza;on (Review) Goals Scope: local, superlocal, regional, global (intraprocedural), interprocedural

More information

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining CMU 18-447 Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining Instructor: Prof Onur Mutlu TAs: Rachata Ausavarungnirun, Kevin Chang, Albert Cho, Jeremie

More information

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle

Computer Engineering Department. CC 311- Computer Architecture. Chapter 4. The Processor: Datapath and Control. Single Cycle Computer Engineering Department CC 311- Computer Architecture Chapter 4 The Processor: Datapath and Control Single Cycle Introduction The 5 classic components of a computer Processor Input Control Memory

More information

Simple Instruction-Pipelining (cont.) Pipelining Jumps

Simple Instruction-Pipelining (cont.) Pipelining Jumps 6.823, L9--1 Simple ruction-pipelining (cont.) + Interrupts Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Src1 ( j / ~j ) Src2 ( / Ind) Pipelining Jumps

More information

The Working Set Model for Program Behavior. Peter J. Denning Massachuse=s Ins?tute of Technology, Cambridge, Massachuse=s

The Working Set Model for Program Behavior. Peter J. Denning Massachuse=s Ins?tute of Technology, Cambridge, Massachuse=s The Working Set Model for Program Behavior Peter J. Denning Massachuse=s Ins?tute of Technology, Cambridge, Massachuse=s 1 About the paper Published in 1968, 1st ACM Symposium on Opera?on Systems Principles

More information

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example This Unit: Scheduling (Static + Dnamic) CIS 50 Computer Architecture Unit 8: Static and Dnamic Scheduling Application OS Compiler Firmware CPU I/O Memor Digital Circuits Gates & Transistors! Previousl:!

More information

Register Alloca.on. CMPT 379: Compilers Instructor: Anoop Sarkar. anoopsarkar.github.io/compilers-class

Register Alloca.on. CMPT 379: Compilers Instructor: Anoop Sarkar. anoopsarkar.github.io/compilers-class Register Alloca.on CMPT 379: Compilers Instructor: Anoop Sarkar anoopsarkar.github.io/compilers-class 1 Register Alloca.on Intermediate code uses unlimited temporaries Simplifying code genera.on and op.miza.on

More information

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control

Logic and Computer Design Fundamentals. Chapter 8 Sequencing and Control Logic and Computer Design Fundamentals Chapter 8 Sequencing and Control Datapath and Control Datapath - performs data transfer and processing operations Control Unit - Determines enabling and sequencing

More information

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1> Chapter 5 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 5 Chapter 5 :: Topics Introduction Arithmetic Circuits umber Systems Sequential Building

More information

Lecture 9: Control Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 9: Control Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lectre 9: Control Hazard and Resoltion James C. Hoe Department of ECE Carnegie ellon University 18 447 S18 L09 S1, James C. Hoe, CU/ECE/CALC, 2018 Yor goal today Hosekeeping simple control flow

More information

CS-683: Advanced Computer Architecture Course Introduction

CS-683: Advanced Computer Architecture Course Introduction CS-683: Advanced Computer Architecture Course Introduction Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law Topics 2 Page The Nature of Time real (i.e. wall clock) time = User Time: time spent

More information

Measurement & Performance

Measurement & Performance Measurement & Performance Topics Timers Performance measures Time-based metrics Rate-based metrics Benchmarking Amdahl s law 2 The Nature of Time real (i.e. wall clock) time = User Time: time spent executing

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I. Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

EECS150. Arithmetic Circuits

EECS150. Arithmetic Circuits EE5 ection 8 Arithmetic ircuits Fall 2 Arithmetic ircuits Excellent Examples of ombinational Logic Design Time vs. pace Trade-offs Doing things fast may require more logic and thus more space Example:

More information

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions

Lecture 12: Pipelined Implementations: Control Hazards and Resolutions 18-447 Lectre 12: Pipelined Implementations: Control Hazards and Resoltions S 09 L12-1 James C. Hoe Dept of ECE, CU arch 2, 2009 Annoncements: Spring break net week!! Project 2 de the week after spring

More information

TEST 1 REVIEW. Lectures 1-5

TEST 1 REVIEW. Lectures 1-5 TEST 1 REVIEW Lectures 1-5 REVIEW Test 1 will cover lectures 1-5. There are 10 questions in total with the last being a bonus question. The questions take the form of short answers (where you are expected

More information

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary EECS50 - Digital Design Lecture - Shifters & Counters February 24, 2003 John Wawrzynek Spring 2005 EECS50 - Lec-counters Page Register Summary All registers (this semester) based on Flip-flops: q 3 q 2

More information

Counters. We ll look at different kinds of counters and discuss how to build them

Counters. We ll look at different kinds of counters and discuss how to build them Counters We ll look at different kinds of counters and discuss how to build them These are not only examples of sequential analysis and design, but also real devices used in larger circuits 1 Introducing

More information

Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1

Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1 Loop Scheduling and Software Pipelining 2008-04-24 \course\cpeg421-08s\topic-7.ppt 1 Reading List Slides: Topic 7 and 7a Other papers as assigned in class or homework: 2008-04-24 \course\cpeg421-08s\topic-7.ppt

More information

Clock-driven scheduling

Clock-driven scheduling Clock-driven scheduling Also known as static or off-line scheduling Michal Sojka Czech Technical University in Prague, Faculty of Electrical Engineering, Department of Control Engineering November 8, 2017

More information

Performance of Computers. Performance of Computers. Defining Performance. Forecast

Performance of Computers. Performance of Computers. Defining Performance. Forecast Performance of Computers Which computer is fastest? Not so simple scientific simulation - FP performance program development - Integer performance commercial work - I/O Performance of Computers Want to

More information

CPE100: Digital Logic Design I

CPE100: Digital Logic Design I Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu CPE100: Digital Logic Design I Final Review http://www.ee.unlv.edu/~b1morris/cpe100/ 2 Logistics Tuesday Dec 12 th 13:00-15:00 (1-3pm) 2 hour

More information

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk ECE290 Fall 2012 Lecture 22 Dr. Zbigniew Kalbarczyk Today LC-3 Micro-sequencer (the control store) LC-3 Micro-programmed control memory LC-3 Micro-instruction format LC -3 Micro-sequencer (the circuitry)

More information

A Second Datapath Example YH16

A Second Datapath Example YH16 A Second Datapath Example YH16 Lecture 09 Prof. Yih Huang S365 1 A 16-Bit Architecture: YH16 A word is 16 bit wide 32 general purpose registers, 16 bits each Like MIPS, 0 is hardwired zero. 16 bit P 16

More information

System Data Bus (8-bit) Data Buffer. Internal Data Bus (8-bit) 8-bit register (R) 3-bit address 16-bit register pair (P) 2-bit address

System Data Bus (8-bit) Data Buffer. Internal Data Bus (8-bit) 8-bit register (R) 3-bit address 16-bit register pair (P) 2-bit address Intel 8080 CPU block diagram 8 System Data Bus (8-bit) Data Buffer Registry Array B 8 C Internal Data Bus (8-bit) F D E H L ALU SP A PC Address Buffer 16 System Address Bus (16-bit) Internal register addressing:

More information

Implementing the Controller. Harvard-Style Datapath for DLX

Implementing the Controller. Harvard-Style Datapath for DLX 6.823, L6--1 Implementing the Controller Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 6.823, L6--2 Harvard-Style Datapath for DLX Src1 ( j / ~j ) Src2 ( R / RInd) RegWrite MemWrite

More information

Computer Architecture. ECE 361 Lecture 5: The Design Process & ALU Design. 361 design.1

Computer Architecture. ECE 361 Lecture 5: The Design Process & ALU Design. 361 design.1 Computer Architecture ECE 361 Lecture 5: The Design Process & Design 361 design.1 Quick Review of Last Lecture 361 design.2 MIPS ISA Design Objectives and Implications Support general OS and C- style language

More information

Lecture 3, Performance

Lecture 3, Performance Lecture 3, Performance Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations

More information

CprE 281: Digital Logic

CprE 281: Digital Logic CprE 28: Digital Logic Instructor: Alexander Stoytchev http://www.ece.iastate.edu/~alexs/classes/ Simple Processor CprE 28: Digital Logic Iowa State University, Ames, IA Copyright Alexander Stoytchev Digital

More information

EE 660: Computer Architecture Out-of-Order Processors

EE 660: Computer Architecture Out-of-Order Processors EE 660: Computer Architecture Out-of-Order Processors Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Based on the slides of Prof. David entzlaff Agenda I4 Processors I2O2

More information

L07-L09 recap: Fundamental lesson(s)!

L07-L09 recap: Fundamental lesson(s)! L7-L9 recap: Fundamental lesson(s)! Over the next 3 lectures (using the IPS ISA as context) I ll explain:! How functions are treated and processed in assembly! How system calls are enabled in assembly!

More information

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II TDDB68 Concurrent programming and operating systems Lecture: CPU Scheduling II Mikael Asplund, Senior Lecturer Real-time Systems Laboratory Department of Computer and Information Science Copyright Notice:

More information

Priority-driven Scheduling of Periodic Tasks (1) Advanced Operating Systems (M) Lecture 4

Priority-driven Scheduling of Periodic Tasks (1) Advanced Operating Systems (M) Lecture 4 Priority-driven Scheduling of Periodic Tasks (1) Advanced Operating Systems (M) Lecture 4 Priority-driven Scheduling Assign priorities to jobs, based on their deadline or other timing constraint Make scheduling

More information

Parallel Performance Theory - 1

Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis

More information

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Yuchun Ma* Zhuoyuan Li* Jason Cong Xianlong Hong Glenn Reinman Sheqin Dong* Qiang Zhou *Department of Computer Science &

More information

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman.

DSP Design Lecture 7. Unfolding cont. & Folding. Dr. Fredrik Edman. SP esign Lecture 7 Unfolding cont. & Folding r. Fredrik Edman fredrik.edman@eit.lth.se Unfolding Unfolding creates a program with more than one iteration, J=unfolding factor Unfolding is a structured way

More information

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining Slide 1 EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining Slide 2 Topics Clocking Clock Parameters Latch Types Requirements for reliable clocking Pipelining Optimal pipelining

More information

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs April 16, 2009 John Wawrzynek Spring 2009 EECS150 - Lec24-blocks Page 1 Cross-coupled NOR gates remember, If both R=0 & S=0, then

More information

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH

More information

Branch Prediction using Advanced Neural Methods

Branch Prediction using Advanced Neural Methods Branch Prediction using Advanced Neural Methods Sunghoon Kim Department of Mechanical Engineering University of California, Berkeley shkim@newton.berkeley.edu Abstract Among the hardware techniques, two-level

More information

Spiral 2-1. Datapath Components: Counters Adders Design Example: Crosswalk Controller

Spiral 2-1. Datapath Components: Counters Adders Design Example: Crosswalk Controller 2-. piral 2- Datapath Components: Counters s Design Example: Crosswalk Controller 2-.2 piral Content Mapping piral Theory Combinational Design equential Design ystem Level Design Implementation and Tools

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19

More information

Lecture 3, Performance

Lecture 3, Performance Repeating some definitions: Lecture 3, Performance CPI MHz MIPS MOPS Clocks Per Instruction megahertz, millions of cycles per second Millions of Instructions Per Second = MHz / CPI Millions of Operations

More information

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017 CMSC CMSC : Lecture Greedy Algorithms for Scheduling Tuesday, Sep 9, 0 Reading: Sects.. and. of KT. (Not covered in DPV.) Interval Scheduling: We continue our discussion of greedy algorithms with a number

More information

CMP 338: Third Class

CMP 338: Third Class CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does

More information

Introduction The Nature of High-Performance Computation

Introduction The Nature of High-Performance Computation 1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential

More information

Review: Designing with FSM. EECS Components and Design Techniques for Digital Systems. Lec09 Counters Outline.

Review: Designing with FSM. EECS Components and Design Techniques for Digital Systems. Lec09 Counters Outline. Review: Designing with FSM EECS 150 - Components and Design Techniques for Digital Systems Lec09 Counters 9-28-04 David Culler Electrical Engineering and Computer Sciences University of California, Berkeley

More information

Processor Design & ALU Design

Processor Design & ALU Design 3/8/2 Processor Design A. Sahu CSE, IIT Guwahati Please be updated with http://jatinga.iitg.ernet.in/~asahu/c22/ Outline Components of CPU Register, Multiplexor, Decoder, / Adder, substractor, Varity of

More information

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering

ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN. Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Week 9 Dr. Srinivas Shakkottai Dept. of Electrical and Computer Engineering TIMING ANALYSIS Overview Circuits do not respond instantaneously to input changes

More information

Timing analysis and timing predictability

Timing analysis and timing predictability Timing analysis and timing predictability Caches in WCET Analysis Reinhard Wilhelm 1 Jan Reineke 2 1 Saarland University, Saarbrücken, Germany 2 University of California, Berkeley, USA ArtistDesign Summer

More information

Caches in WCET Analysis

Caches in WCET Analysis Caches in WCET Analysis Jan Reineke Department of Computer Science Saarland University Saarbrücken, Germany ARTIST Summer School in Europe 2009 Autrans, France September 7-11, 2009 Jan Reineke Caches in

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S. Pipelined Datapath Lectre notes from KP, H. H. Lee and S. Yalamanchili Sections 4.5 4. Practice Problems:, 3, 8, 2 Reading (2) Pipeline Performance Assme time for stages is v ps for register read or write

More information

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits

Pattern History Table. Global History Register. Pattern History Table. Branch History Pattern Pattern History Bits An Enhanced Two-Level Adaptive Multiple Branch Prediction for Superscalar Processors Jong-bok Lee, Soo-Mook Moon and Wonyong Sung fjblee@mpeg,smoon@altair,wysung@dspg.snu.ac.kr School of Electrical Engineering,

More information

A Detailed Study on Phase Predictors

A Detailed Study on Phase Predictors A Detailed Study on Phase Predictors Frederik Vandeputte, Lieven Eeckhout, and Koen De Bosschere Ghent University, Electronics and Information Systems Department Sint-Pietersnieuwstraat 41, B-9000 Gent,

More information

VLSI Physical Design Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

VLSI Physical Design Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur VLSI Physical Design Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 54 Design for Testability So, in the last lecture we talked

More information

EECS150 - Digital Design Lecture 21 - Design Blocks

EECS150 - Digital Design Lecture 21 - Design Blocks EECS150 - Digital Design Lecture 21 - Design Blocks April 3, 2012 John Wawrzynek Spring 2012 EECS150 - Lec21-db3 Page 1 Fixed Shifters / Rotators fixed shifters hardwire the shift amount into the circuit.

More information