Origami: Folding Warps for Energy Efficient GPUs

Size: px
Start display at page:

Download "Origami: Folding Warps for Energy Efficient GPUs"

Transcription

1 Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong, Justin Huang and Murali Annavaram* * University of Southern alifornia University of alifornia, Riverside Stanford University

2 Outline GPU overview Motivation and related work Warp Folding Origami Scheduler Evaluation 2

3 GPGPU Overview (GTX480) SM Instruction ache Fetch and decode Warp Scheduler (2-level) SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST INT Unit Operands FP Unit Result Queue Register File 128KB SFU LD/ST LD/ST LD/ST LD/ST Execution Units 64KB shared Memory/L1 cache SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST 193

4 GPGPU Overview (GTX480) SM Instruction ache Fetch and decode Warp Scheduler (2-level) SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST INT Unit Operands FP Unit Result Queue Register File 128KB SFU LD/ST LD/ST LD/ST LD/ST Execution Units 64KB shared Memory/L1 cache SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST 193

5 GPGPU Power Break-Down DRAM EXE RF L M Pipeline onstant NO Other GPUWattch, ISA

6 GPGPU Power Break-Down DRAM EXE RF L M EXE 20.1% Pipeline onstant NO Other GPUWattch, ISA

7 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5

8 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5

9 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5

10 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5

11 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5

12 Technology Scaling As technology scales leakage power will increase Accounts for 50% of the execution units power Power Gating can be used to reduce the leakage power Need long idle periods to be effective Warped Gates, MIRO

13 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET 7

14 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity 7

15 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity Energy Loss or Neutral 7

16 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity Energy Loss or Neutral Energy Savings 7

17 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity Energy Loss or Neutral Energy Savings 7

18 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity Energy Loss or Neutral Energy Savings Need to increase idle period length 7

19 Warp Scheduler Effect on Power Gating INT FP INT INT FP INTO Ready Warps INT FP 8 8

20 Warp Scheduler Effect on Power Gating INT FP Ready Warps 8 Busy (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) Idle 8

21 Warp Scheduler Effect on Power Gating Ready Warps Idle periods interrupted by instructions that are greedily scheduled (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) INT (INT) (INT) (INT) (INT) (INT) FP Need to coalesce warp issues (INT) (INT) (INT) (INT) by resource type Busy Idle 8 8

22 Related Work/Warped-Gates* Schedule instructions based on their type Force power gated units to stay in power gating state for at least the breakeven time 30% 54.3% 0.0% 45.7% Frequency 23% 15% 8% 0% Idle period length *Warped-Gates, MIRO

23 Related Work/Warped-Gates* Schedule instructions based on their type Force power gated units to stay in power gating state for at least the breakeven time 30% 54.3% 0.0% 45.7% Frequency 23% 15% 8% 0% Idle period length *Warped-Gates, MIRO

24 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities 10

25 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler INT INT INT INT FP FP FP FP 10

26 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler INT INT INT INT FP FP FP FP 10

27 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT INT FP FP FP FP 10

28 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X INT FP 10

29 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X ycle X+1 Bubble INT FP 10

30 Fine grain idleness Temporal idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X ycle X+1 Bubble ycle X INT FP 10

31 Fine grain idleness Temporal idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X ycle X+1 Bubble ycle X ycle X+3 Bubble INT FP 10

32 Fine grain idleness Temporal idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X ycle X+1 Bubble ycle X ycle X+3 Bubble INT FP 10

33 Spatial Idleness Fine grain idleness Lanes have different activity Branch divergence Insufficient parallelism 11

34 Warp Folding Improve the power gating potential by coalescing the pipeline bubbles 12

35 Warp Folding Scheduler Ready Warps Queue Active Mask: Active Mask: Issued Warps Bubble 39

36 Warp Folding Scheduler Ready Warps Queue Active Mask: Active Mask: Issued Warps Bubble Sub_Warp0: Sub_Warp1: 39

37 Warp Folding Scheduler Ready Warps Queue Active Mask: Active Mask: Issued Warps Bubble Sub_Warp0: Sub_Warp1:

38 Warp Folding Scheduler Ready Warps Queue Issued Warps Active Mask: Active Mask: Bubble Sub_Warp0: Sub_Warp1:

39 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1:

40 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1: Sub_Warp0: Sub_Warp1:

41 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1: Sub_Warp0: Sub_Warp1:

42 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1:

43 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1:

44 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1:

45 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask:

46 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: 14

47 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:

48 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:

49 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: Simple - High wiring overhead - Delay 14

50 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:

51 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: 14

52 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: 14

53 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:

54 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:

55 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: Simple + Low wiring overhead + Small delay +Support for lane shuffling 14

56 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: Simple + Low wiring overhead + Small delay +Support for lane shuffling 14

57 Warp Folding Pipeline SP Pipe Result ollector 15

58 Warp Folding Pipeline SP Pipe Result ollector 15

59 Warp Folding Pipeline SP Pipe Result ollector SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Selective Write Result ollector 15

60 Example 16

61 Example

62 Example

63 Example

64 Example

65 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16

66 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16

67 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16

68 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16

69 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16

70 Origami scheduler Improve the power gating potential by coalescing warps based on: Threads utilization Instruction type 17

71 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads 18

72 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5:

73 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 18

74 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 18

75 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 18

76 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Active masks are not aligned!!! Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 18

77 Lane Shifting Shift the threads to the lower order SIMT lanes Done at the cluster level to reduce overhead Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 19

78 Lane Shifting Shift the threads to the lower order SIMT lanes Done at the cluster level to reduce overhead Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 19

79 Origami Scheduling Runtime Warp Folding Algorithm Folds warps long enough to guarantee savings Nphase = Npipelineflush + Nidledetect + Nbreakeventime Adaptive Folding Aggressive folding for warps with lower instruction count onservative folding for warps with higher instruction count hange folding frequency based on application utilization See paper for more detail! 20

80 EVALUATION 21

81 GPGPU-Sim v3.0.2 Nvidia GTX480 Evaluation Methodology GPUWattch and McPAT for energy and area estimation Benchmarks from ISPASS, Rodinia and Parboil Power gating parameters Wakeup delay 3 cycles Breakeven time 14 cycles Idle detect 5 cycles 22

82 Folding Ratio Folding frequency is application dependent 23

83 Folding Ratio Folding frequency is application dependent 23

84 Folding Ratio Folding frequency is application dependent 23

85 Energy Savings/INT Eliminates negative energy savings Origami scheduler able to amplify folding benefits Origami is able to save 49% 24

86 Energy Savings/INT Eliminates negative energy savings Origami scheduler able to amplify folding benefits Origami is able to save 49% 24

87 Energy Savings/FP Eliminates negative energy savings Origami scheduler able to amplify folding benefits Origami is able to save 46% 25

88 Energy Savings/FP Eliminates negative energy savings Origami scheduler able to amplify folding benefits Origami is able to save 46% 25

89 Performance Origami is able to reduce the performance overhead significantly over Warped-Gates Origami scheduler has positive impact on performance for some workloads 26

90 Performance Origami is able to reduce the performance overhead significantly over Warped-Gates Origami scheduler has positive impact on performance for some workloads 26

91 onclusion Execution units energy efficiency is critical Take advantage of the spatial and temporal idleness to Improve the power gating potential Warp folding Adaptively fold warp to coalesce bubbles Origmai scheduler Scheduler warps based on the threads activity and type. Able to save 49% and 46% of the execution units leakage energy Negligible performance overhead 27

92 Questions? Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong, Justin Huang and Murali Annavaram* * University of Southern alifornia University of alifornia, Riverside Stanford University

93 THANK YOU! 29

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042 Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SR and Freescale under SR task number 2042 Lukasz G. Szafaryn, Kevin Skadron Department of omputer Science

More information

Vector Lane Threading

Vector Lane Threading Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program

More information

CSCI-564 Advanced Computer Architecture

CSCI-564 Advanced Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA

More information

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

Optimized LU-decomposition with Full Pivot for Small Batched Matrices S3069

Optimized LU-decomposition with Full Pivot for Small Batched Matrices S3069 Optimized LU-decomposition with Full Pivot for Small Batched Matrices S369 Ian Wainwright High Performance Consulting Sweden ian.wainwright@hpcsweden.se Based on work for GTC 212: 1x speed-up vs multi-threaded

More information

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) Tung Chou January 5, 2012 QUAD Stream cipher. Security relies on MQ (Multivariate Quadratics). QUAD The Provably-secure QUAD(q, n, r) Stream Cipher

More information

Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB

Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB Supplemental material Guido Klingbeil, Radek Erban, Mike Giles, and Philip K. Maini This document

More information

Scheduling for Reduced CPU Energy

Scheduling for Reduced CPU Energy Scheduling for Reduced CPU Energy M. Weiser, B. Welch, A. Demers and S. Shenker Appears in "Proceedings of the First Symposium on Operating Systems Design and Implementation," Usenix Association, November

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. I fetch decode & eg-fetch execute memory Clock period

More information

Performance, Power & Energy

Performance, Power & Energy Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance?

More information

Microarchitectural Techniques for Power Gating of Execution Units

Microarchitectural Techniques for Power Gating of Execution Units 2.2 Microarchitectural Techniques for Power Gating of Execution Units Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor Zyuban, Hans Jacobson, Pradip Bose IBM T. J. Watson Research Center ABSTRACT

More information

Hardware Acceleration of DNNs

Hardware Acceleration of DNNs Lecture 12: Hardware Acceleration of DNNs Visual omputing Systems Stanford S348V, Winter 2018 Hardware acceleration for DNNs Huawei Kirin NPU Google TPU: Apple Neural Engine Intel Lake rest Deep Learning

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

SISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M

SISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M 8/8/26 S528 arallel Architecture lassification & Single ore Architecture arallel Architecture lassification A Sahu Dept of SE, IIT Guwahati A Sahu Flynn s lassification SISD Architecture ategories M SISD

More information

Simple Instruction-Pipelining. Pipelined Harvard Datapath

Simple Instruction-Pipelining. Pipelined Harvard Datapath 6.823, L8--1 Simple ruction-pipelining Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. fetch decode & eg-fetch execute

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.

More information

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference) ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC

More information

CRYPTOGRAPHIC COMPUTING

CRYPTOGRAPHIC COMPUTING CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,

More information

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY Daniel Sanchez and Christos Kozyrakis Stanford University MICRO-43, December 6 th 21 Executive Summary 2 Mitigating the memory wall requires large, highly

More information

Real-time signal detection for pulsars and radio transients using GPUs

Real-time signal detection for pulsars and radio transients using GPUs Real-time signal detection for pulsars and radio transients using GPUs W. Armour, M. Giles, A. Karastergiou and C. Williams. University of Oxford. 15 th July 2013 1 Background of GPUs Why use GPUs? Influence

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

Microprocessor Power Analysis by Labeled Simulation

Microprocessor Power Analysis by Labeled Simulation Microprocessor Power Analysis by Labeled Simulation Cheng-Ta Hsieh, Kevin Chen and Massoud Pedram University of Southern California Dept. of EE-Systems Los Angeles CA 989 Outline! Introduction! Problem

More information

Enhancing the Sniper Simulator with Thermal Measurement

Enhancing the Sniper Simulator with Thermal Measurement Proceedings of the 18th International Conference on System Theory, Control and Computing, Sinaia, Romania, October 17-19, 214 Enhancing the Sniper Simulator with Thermal Measurement Adrian Florea, Claudiu

More information

Solving RODEs on GPU clusters

Solving RODEs on GPU clusters HIGH TEA @ SCIENCE Solving RODEs on GPU clusters Christoph Riesinger Technische Universität München March 4, 206 HIGH TEA @ SCIENCE, March 4, 206 Motivation - Parallel Computing HIGH TEA @ SCIENCE, March

More information

DMP. Deterministic Shared Memory Multiprocessing. Presenter: Wu, Weiyi Yale University

DMP. Deterministic Shared Memory Multiprocessing. Presenter: Wu, Weiyi Yale University DMP Deterministic Shared Memory Multiprocessing 1 Presenter: Wu, Weiyi Yale University Outline What is determinism? How to make execution deterministic? What s the overhead of determinism? 2 What Is Determinism?

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design

CMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance Simultaneous Branch and Warp Interweaving for Sustained GPU Performance Nicolas Brunie Kalray and ENS Lyon nicolas.brunie@kalray.eu Sylvain Collange Univ. Federal de Minas Gerais sylvain.collange@dcc.ufmg.br

More information

Non-Preemptive and Limited Preemptive Scheduling. LS 12, TU Dortmund

Non-Preemptive and Limited Preemptive Scheduling. LS 12, TU Dortmund Non-Preemptive and Limited Preemptive Scheduling LS 12, TU Dortmund 09 May 2017 (LS 12, TU Dortmund) 1 / 31 Outline Non-Preemptive Scheduling A General View Exact Schedulability Test Pessimistic Schedulability

More information

Department of Electrical and Computer Engineering The University of Texas at Austin

Department of Electrical and Computer Engineering The University of Texas at Austin Department of Electrical and Computer Engineering The University of Texas at Austin EE 360N, Fall 2004 Yale Patt, Instructor Aater Suleman, Huzefa Sanjeliwala, Dam Sunwoo, TAs Exam 1, October 6, 2004 Name:

More information

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory

More information

Reducing NVM Writes with Optimized Shadow Paging

Reducing NVM Writes with Optimized Shadow Paging Reducing NVM Writes with Optimized Shadow Paging Yuanjiang Ni, Jishen Zhao, Daniel Bittman, Ethan L. Miller Center for Research in Storage Systems University of California, Santa Cruz Emerging Technology

More information

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Alexander Smith & Khashayar Khavari Department of Electrical and Computer Engineering University of Toronto April 15, 2009 Alexander

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining

CMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining CMU 18-447 Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining Instructor: Prof Onur Mutlu TAs: Rachata Ausavarungnirun, Kevin Chang, Albert Cho, Jeremie

More information

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)

Issue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide) Out-of-order Pipeline Buffer of instructions Issue = Select + Wakeup Select N oldest, read instructions N=, xor N=, xor and sub Note: ma have execution resource constraints: i.e., load/store/fp Fetch Decode

More information

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints Energy-efficient Mapping of Big Data Workflows under Deadline Constraints Presenter: Tong Shu Authors: Tong Shu and Prof. Chase Q. Wu Big Data Center Department of Computer Science New Jersey Institute

More information

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation Jingyang Zhu 1, Zhiliang Qian 2, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline

More information

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Yuchun Ma* Zhuoyuan Li* Jason Cong Xianlong Hong Glenn Reinman Sheqin Dong* Qiang Zhou *Department of Computer Science &

More information

CEC 450 Real-Time Systems

CEC 450 Real-Time Systems E 450 Real-ime Systems Lecture 4 Rate Monotonic heory Part September 7, 08 Sam Siewert Quiz Results 93% Average, 75 low, 00 high Goal is to learn what you re not learning yet Motivation to keep up with

More information

CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues

CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan,

More information

Scalable Store-Load Forwarding via Store Queue Index Prediction

Scalable Store-Load Forwarding via Store Queue Index Prediction Scalable Store-Load Forwarding via Store Queue Index Prediction Tingting Sha, Milo M.K. Martin, Amir Roth University of Pennsylvania {shatingt, milom, amir}@cis.upenn.edu addr addr addr (CAM) predictor

More information

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

ESE 570: Digital Integrated Circuits and VLSI Fundamentals ESE 570: Digital Integrated Circuits and VLSI Fundamentals Lec 21: April 4, 2017 Memory Overview, Memory Core Cells Penn ESE 570 Spring 2017 Khanna Today! Memory " Classification " ROM Memories " RAM Memory

More information

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example

This Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example This Unit: Scheduling (Static + Dnamic) CIS 50 Computer Architecture Unit 8: Static and Dnamic Scheduling Application OS Compiler Firmware CPU I/O Memor Digital Circuits Gates & Transistors! Previousl:!

More information

A Novel Software Solution for Localized Thermal Problems

A Novel Software Solution for Localized Thermal Problems A Novel Software Solution for Localized Thermal Problems Sung Woo Chung 1,* and Kevin Skadron 2 1 Division of Computer and Communication Engineering, Korea University, Seoul 136-713, Korea swchung@korea.ac.kr

More information

From Physics to Logic

From Physics to Logic From Physics to Logic This course aims to introduce you to the layers of abstraction of modern computer systems. We won t spend much time below the level of bits, bytes, words, and functional units, but

More information

Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors

Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors Qiang Wu Philo Juang Margaret Martonosi Douglas W. Clark Depts. of Computer Science and Electrical Engineering

More information

CHAPTER 5 - PROCESS SCHEDULING

CHAPTER 5 - PROCESS SCHEDULING CHAPTER 5 - PROCESS SCHEDULING OBJECTIVES To introduce CPU scheduling, which is the basis for multiprogrammed operating systems To describe various CPU-scheduling algorithms To discuss evaluation criteria

More information

Computer Architecture

Computer Architecture Computer Architecture QtSpim, a Mips simulator S. Coudert and R. Pacalet January 4, 2018..................... Memory Mapping 0xFFFF000C 0xFFFF0008 0xFFFF0004 0xffff0000 0x90000000 0x80000000 0x7ffff4d4

More information

Chapter Overview. Memory Classification. Memory Architectures. The Memory Core. Periphery. Reliability. Memory

Chapter Overview. Memory Classification. Memory Architectures. The Memory Core. Periphery. Reliability. Memory SRAM Design Chapter Overview Classification Architectures The Core Periphery Reliability Semiconductor Classification RWM NVRWM ROM Random Access Non-Random Access EPROM E 2 PROM Mask-Programmed Programmable

More information

Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions

Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions In the Proceedings of the International Conference on Dependable Systems and Networks(DSN 07), June 2007. Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions Xiaodong Li,

More information

Name: Answers. Mean: 83, Standard Deviation: 12 Q1 Q2 Q3 Q4 Q5 Q6 Total. ESE370 Fall 2015

Name: Answers. Mean: 83, Standard Deviation: 12 Q1 Q2 Q3 Q4 Q5 Q6 Total. ESE370 Fall 2015 University of Pennsylvania Department of Electrical and System Engineering Circuit-Level Modeling, Design, and Optimization for Digital Systems ESE370, Fall 2015 Final Tuesday, December 15 Problem weightings

More information

Synthesizing Asynchronous Burst-Mode Machines without the Fundamental-Mode Timing Assumption

Synthesizing Asynchronous Burst-Mode Machines without the Fundamental-Mode Timing Assumption Synthesizing synchronous urst-mode Machines without the Fundamental-Mode Timing ssumption Gennette Gill Montek Singh Univ. of North Carolina Chapel Hill, NC, US Contribution Synthesize robust asynchronous

More information

Digital Systems. Validation, verification. R. Pacalet January 4, 2018

Digital Systems. Validation, verification. R. Pacalet January 4, 2018 Digital Systems Validation, verification R. Pacalet January 4, 2018 2/98 Simulation Extra design tasks Reference model Simulation environment A simulation cannot be exhaustive Can discover a bug Cannot

More information

Digital Integrated Circuits A Design Perspective. Semiconductor. Memories. Memories

Digital Integrated Circuits A Design Perspective. Semiconductor. Memories. Memories Digital Integrated Circuits A Design Perspective Semiconductor Chapter Overview Memory Classification Memory Architectures The Memory Core Periphery Reliability Case Studies Semiconductor Memory Classification

More information

Lecture 13. Real-Time Scheduling. Daniel Kästner AbsInt GmbH 2013

Lecture 13. Real-Time Scheduling. Daniel Kästner AbsInt GmbH 2013 Lecture 3 Real-Time Scheduling Daniel Kästner AbsInt GmbH 203 Model-based Software Development 2 SCADE Suite Application Model in SCADE (data flow + SSM) System Model (tasks, interrupts, buses, ) SymTA/S

More information

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.

Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements. 1 2 Introduction Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements. Defines the precise instants when the circuit is allowed to change

More information

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ CS-206 Concurrency Lecture 13 Wrap Up Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Created by Nooshin Mirzadeh, Georgios Psaropoulos and Babak Falsafi EPFL Copyright 2015 EPFL CS-206 Spring

More information

Worst-Case Execution Time Analysis. LS 12, TU Dortmund

Worst-Case Execution Time Analysis. LS 12, TU Dortmund Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper

More information

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

ESE 570: Digital Integrated Circuits and VLSI Fundamentals ESE 570: Digital Integrated Circuits and VLSI Fundamentals Lec 19: March 29, 2018 Memory Overview, Memory Core Cells Today! Charge Leakage/Charge Sharing " Domino Logic Design Considerations! Logic Comparisons!

More information

Instruction Set Extensions for Reed-Solomon Encoding and Decoding

Instruction Set Extensions for Reed-Solomon Encoding and Decoding Instruction Set Extensions for Reed-Solomon Encoding and Decoding Suman Mamidi and Michael J Schulte Dept of ECE University of Wisconsin-Madison {mamidi, schulte}@caewiscedu http://mesaecewiscedu Daniel

More information

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing:

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on

More information

Simple Instruction-Pipelining (cont.) Pipelining Jumps

Simple Instruction-Pipelining (cont.) Pipelining Jumps 6.823, L9--1 Simple ruction-pipelining (cont.) + Interrupts Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Src1 ( j / ~j ) Src2 ( / Ind) Pipelining Jumps

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

Technical Report GIT-CERCS. Thermal Field Management for Many-core Processors

Technical Report GIT-CERCS. Thermal Field Management for Many-core Processors Technical Report GIT-CERCS Thermal Field Management for Many-core Processors Minki Cho, Nikhil Sathe, Sudhakar Yalamanchili and Saibal Mukhopadhyay School of Electrical and Computer Engineering Georgia

More information

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte

A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER Jesus Garcia and Michael J. Schulte Lehigh University Department of Computer Science and Engineering Bethlehem, PA 15 ABSTRACT Galois field arithmetic

More information

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes

CMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN

More information

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

! Charge Leakage/Charge Sharing.  Domino Logic Design Considerations. ! Logic Comparisons. ! Memory.  Classification.  ROM Memories. ESE 57: Digital Integrated Circuits and VLSI Fundamentals Lec 9: March 9, 8 Memory Overview, Memory Core Cells Today! Charge Leakage/ " Domino Logic Design Considerations! Logic Comparisons! Memory " Classification

More information

Making OpenVX Really Real Time

Making OpenVX Really Real Time Making OpenVX Really Real Time Ming Yang, Tanya Amert, Kecheng Yang,, Nathan Otterness, James H. Anderson, F. Donelson Smith, and Shige Wang 3 University of North Carolina at Chapel Hill Texas State University

More information

3. (2) What is the difference between fixed and hybrid instructions?

3. (2) What is the difference between fixed and hybrid instructions? 1. (2 pts) What is a "balanced" pipeline? 2. (2 pts) What are the two main ways to define performance? 3. (2) What is the difference between fixed and hybrid instructions? 4. (2 pts) Clock rates have grown

More information

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control

EXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control The MIPS Pipeline CSCI206 - Computer Organization & Programming Pipeline Datapath and Control zybook: 11.6 Developed and maintained by the Bucknell University Computer Science Department - 2017 Hazard

More information

Fall 2008 CSE Qualifying Exam. September 13, 2008

Fall 2008 CSE Qualifying Exam. September 13, 2008 Fall 2008 CSE Qualifying Exam September 13, 2008 1 Architecture 1. (Quan, Fall 2008) Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for

More information

CMSC 313 Lecture 16 Announcement: no office hours today. Good-bye Assembly Language Programming Overview of second half on Digital Logic DigSim Demo

CMSC 313 Lecture 16 Announcement: no office hours today. Good-bye Assembly Language Programming Overview of second half on Digital Logic DigSim Demo CMSC 33 Lecture 6 nnouncement: no office hours today. Good-bye ssembly Language Programming Overview of second half on Digital Logic DigSim Demo UMC, CMSC33, Richard Chang Good-bye ssembly

More information

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors 20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,

More information

MONTE CARLO NEUTRON TRANSPORT SIMULATING NUCLEAR REACTIONS ONE NEUTRON AT A TIME Tony Scudiero NVIDIA

MONTE CARLO NEUTRON TRANSPORT SIMULATING NUCLEAR REACTIONS ONE NEUTRON AT A TIME Tony Scudiero NVIDIA MONTE CARLO NEUTRON TRANSPORT SIMULATING NUCLEAR REACTIONS ONE NEUTRON AT A TIME Tony Scudiero NVIDIA TAKEAWAYS Why Monte Carlo methods are fundamentally different than deterministic methods Inherent Parallelism

More information

Dynamic Voltage and Frequency Scaling Under a Precise Energy Model Considering Variable and Fixed Components of the System Power Dissipation

Dynamic Voltage and Frequency Scaling Under a Precise Energy Model Considering Variable and Fixed Components of the System Power Dissipation Dynamic Voltage and Frequency Scaling Under a Precise Energy Model Csidering Variable and Fixed Compents of the System Power Dissipati Kihwan Choi W-bok Lee Ramakrishna Soma Massoud Pedram University of

More information

Computer Architecture 10. Fast Adders

Computer Architecture 10. Fast Adders Computer Architecture 10 Fast s Ma d e wi t h Op e n Of f i c e. o r g 1 Carry Problem Addition is primary mechanism in implementing arithmetic operations Slow addition directly affects the total performance

More information

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 2. (2 )What are the two main ways to define performance? 3. (2 )What

More information

Semiconductor Memories. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Paolo Spirito

Semiconductor Memories. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Paolo Spirito Semiconductor Memories Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Paolo Spirito Memory Classification Memory Classification Read-Write Memory Non-Volatile Read-Write Memory Read-Only Memory Random

More information

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Raj Parihar Advisor: Prof. Michael C. Huang March 22, 2013 Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

More information

Lecture 3, Performance

Lecture 3, Performance Repeating some definitions: Lecture 3, Performance CPI MHz MIPS MOPS Clocks Per Instruction megahertz, millions of cycles per second Millions of Instructions Per Second = MHz / CPI Millions of Operations

More information

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of ESE532: System-on-a-Chip Architecture Day 20: November 8, 2017 Energy Today Energy Today s bottleneck What drives Efficiency of Processors, FPGAs, accelerators How does parallelism impact energy? 1 2 Message

More information

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS Berk Hess, Szilárd Páll KTH Royal Institute of Technology GTC 2012 GROMACS: fast, scalable, free Classical molecular dynamics package

More information

GPU Computing with Applications in Digital Logic

GPU Computing with Applications in Digital Logic Tampere International Center for Signal Processing. TICSP series # 62 J. Astola, M. Kameyama, M. Lukac & R. S. Stanković (eds.) GPU Computing with Applications in Digital Logic Tampere International Center

More information

Analytical Modeling of Parallel Systems

Analytical Modeling of Parallel Systems Analytical Modeling of Parallel Systems Chieh-Sen (Jason) Huang Department of Applied Mathematics National Sun Yat-sen University Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing

More information

Saving Energy in Sparse and Dense Linear Algebra Computations

Saving Energy in Sparse and Dense Linear Algebra Computations Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:

Amdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then: Amdahl's Law Useful for evaluating the impact of a change. (A general observation.) Insight: Improving a feature cannot improve performance beyond the use of the feature Suppose we introduce a particular

More information

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining

EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining Slide 1 EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining Slide 2 Topics Clocking Clock Parameters Latch Types Requirements for reliable clocking Pipelining Optimal pipelining

More information

Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults

Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults Mark Wilkening, Vilas Sridharan, Si Li, Fritz Previlon, Sudhanva Gurumurthi and David R. Kaeli ECE Department, Northeastern

More information

FastSpot: Host-Compiled Thermal Estimation for Early Design Space Exploration

FastSpot: Host-Compiled Thermal Estimation for Early Design Space Exploration FastSpot: Host-Compiled Thermal Estimation for Early Design Space Exploration Darshan Gandhi, Andreas Gerstlauer, Lizy John Electrical and Computer Engineering The University of Texas at Austin Email:

More information

Nikolaj Leischner, Vitaly Osipov, Peter Sanders

Nikolaj Leischner, Vitaly Osipov, Peter Sanders Nikolaj Leichner, Vitaly Oipov, Peter Sander Intitut für Theoretiche Informatik - Algorithmik II Oipov: 1 KITVitaly Univerität de Lande Baden-Württemerg und nationale Groforchungzentrum in der Helmholtz-Gemeinchaft

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting

More information