Origami: Folding Warps for Energy Efficient GPUs
|
|
- Noel Hubbard
- 5 years ago
- Views:
Transcription
1 Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong, Justin Huang and Murali Annavaram* * University of Southern alifornia University of alifornia, Riverside Stanford University
2 Outline GPU overview Motivation and related work Warp Folding Origami Scheduler Evaluation 2
3 GPGPU Overview (GTX480) SM Instruction ache Fetch and decode Warp Scheduler (2-level) SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST INT Unit Operands FP Unit Result Queue Register File 128KB SFU LD/ST LD/ST LD/ST LD/ST Execution Units 64KB shared Memory/L1 cache SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST 193
4 GPGPU Overview (GTX480) SM Instruction ache Fetch and decode Warp Scheduler (2-level) SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST INT Unit Operands FP Unit Result Queue Register File 128KB SFU LD/ST LD/ST LD/ST LD/ST Execution Units 64KB shared Memory/L1 cache SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST 193
5 GPGPU Power Break-Down DRAM EXE RF L M Pipeline onstant NO Other GPUWattch, ISA
6 GPGPU Power Break-Down DRAM EXE RF L M EXE 20.1% Pipeline onstant NO Other GPUWattch, ISA
7 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5
8 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5
9 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5
10 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5
11 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5
12 Technology Scaling As technology scales leakage power will increase Accounts for 50% of the execution units power Power Gating can be used to reduce the leakage power Need long idle periods to be effective Warped Gates, MIRO
13 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET 7
14 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity 7
15 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity Energy Loss or Neutral 7
16 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity Energy Loss or Neutral Energy Savings 7
17 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity Energy Loss or Neutral Energy Savings 7
18 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity Energy Loss or Neutral Energy Savings Need to increase idle period length 7
19 Warp Scheduler Effect on Power Gating INT FP INT INT FP INTO Ready Warps INT FP 8 8
20 Warp Scheduler Effect on Power Gating INT FP Ready Warps 8 Busy (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) Idle 8
21 Warp Scheduler Effect on Power Gating Ready Warps Idle periods interrupted by instructions that are greedily scheduled (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) INT (INT) (INT) (INT) (INT) (INT) FP Need to coalesce warp issues (INT) (INT) (INT) (INT) by resource type Busy Idle 8 8
22 Related Work/Warped-Gates* Schedule instructions based on their type Force power gated units to stay in power gating state for at least the breakeven time 30% 54.3% 0.0% 45.7% Frequency 23% 15% 8% 0% Idle period length *Warped-Gates, MIRO
23 Related Work/Warped-Gates* Schedule instructions based on their type Force power gated units to stay in power gating state for at least the breakeven time 30% 54.3% 0.0% 45.7% Frequency 23% 15% 8% 0% Idle period length *Warped-Gates, MIRO
24 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities 10
25 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler INT INT INT INT FP FP FP FP 10
26 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler INT INT INT INT FP FP FP FP 10
27 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT INT FP FP FP FP 10
28 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X INT FP 10
29 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X ycle X+1 Bubble INT FP 10
30 Fine grain idleness Temporal idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X ycle X+1 Bubble ycle X INT FP 10
31 Fine grain idleness Temporal idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X ycle X+1 Bubble ycle X ycle X+3 Bubble INT FP 10
32 Fine grain idleness Temporal idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X ycle X+1 Bubble ycle X ycle X+3 Bubble INT FP 10
33 Spatial Idleness Fine grain idleness Lanes have different activity Branch divergence Insufficient parallelism 11
34 Warp Folding Improve the power gating potential by coalescing the pipeline bubbles 12
35 Warp Folding Scheduler Ready Warps Queue Active Mask: Active Mask: Issued Warps Bubble 39
36 Warp Folding Scheduler Ready Warps Queue Active Mask: Active Mask: Issued Warps Bubble Sub_Warp0: Sub_Warp1: 39
37 Warp Folding Scheduler Ready Warps Queue Active Mask: Active Mask: Issued Warps Bubble Sub_Warp0: Sub_Warp1:
38 Warp Folding Scheduler Ready Warps Queue Issued Warps Active Mask: Active Mask: Bubble Sub_Warp0: Sub_Warp1:
39 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1:
40 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1: Sub_Warp0: Sub_Warp1:
41 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1: Sub_Warp0: Sub_Warp1:
42 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1:
43 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1:
44 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1:
45 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask:
46 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: 14
47 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:
48 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:
49 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: Simple - High wiring overhead - Delay 14
50 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:
51 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: 14
52 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: 14
53 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:
54 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:
55 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: Simple + Low wiring overhead + Small delay +Support for lane shuffling 14
56 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: Simple + Low wiring overhead + Small delay +Support for lane shuffling 14
57 Warp Folding Pipeline SP Pipe Result ollector 15
58 Warp Folding Pipeline SP Pipe Result ollector 15
59 Warp Folding Pipeline SP Pipe Result ollector SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Selective Write Result ollector 15
60 Example 16
61 Example
62 Example
63 Example
64 Example
65 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16
66 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16
67 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16
68 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16
69 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16
70 Origami scheduler Improve the power gating potential by coalescing warps based on: Threads utilization Instruction type 17
71 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads 18
72 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5:
73 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 18
74 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 18
75 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 18
76 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Active masks are not aligned!!! Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 18
77 Lane Shifting Shift the threads to the lower order SIMT lanes Done at the cluster level to reduce overhead Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 19
78 Lane Shifting Shift the threads to the lower order SIMT lanes Done at the cluster level to reduce overhead Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 19
79 Origami Scheduling Runtime Warp Folding Algorithm Folds warps long enough to guarantee savings Nphase = Npipelineflush + Nidledetect + Nbreakeventime Adaptive Folding Aggressive folding for warps with lower instruction count onservative folding for warps with higher instruction count hange folding frequency based on application utilization See paper for more detail! 20
80 EVALUATION 21
81 GPGPU-Sim v3.0.2 Nvidia GTX480 Evaluation Methodology GPUWattch and McPAT for energy and area estimation Benchmarks from ISPASS, Rodinia and Parboil Power gating parameters Wakeup delay 3 cycles Breakeven time 14 cycles Idle detect 5 cycles 22
82 Folding Ratio Folding frequency is application dependent 23
83 Folding Ratio Folding frequency is application dependent 23
84 Folding Ratio Folding frequency is application dependent 23
85 Energy Savings/INT Eliminates negative energy savings Origami scheduler able to amplify folding benefits Origami is able to save 49% 24
86 Energy Savings/INT Eliminates negative energy savings Origami scheduler able to amplify folding benefits Origami is able to save 49% 24
87 Energy Savings/FP Eliminates negative energy savings Origami scheduler able to amplify folding benefits Origami is able to save 46% 25
88 Energy Savings/FP Eliminates negative energy savings Origami scheduler able to amplify folding benefits Origami is able to save 46% 25
89 Performance Origami is able to reduce the performance overhead significantly over Warped-Gates Origami scheduler has positive impact on performance for some workloads 26
90 Performance Origami is able to reduce the performance overhead significantly over Warped-Gates Origami scheduler has positive impact on performance for some workloads 26
91 onclusion Execution units energy efficiency is critical Take advantage of the spatial and temporal idleness to Improve the power gating potential Warp folding Adaptively fold warp to coalesce bubbles Origmai scheduler Scheduler warps based on the threads activity and type. Able to save 49% and 46% of the execution units leakage energy Negligible performance overhead 27
92 Questions? Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong, Justin Huang and Murali Annavaram* * University of Southern alifornia University of alifornia, Riverside Stanford University
93 THANK YOU! 29
Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042
Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SR and Freescale under SR task number 2042 Lukasz G. Szafaryn, Kevin Skadron Department of omputer Science
More informationVector Lane Threading
Vector Lane Threading S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis Computer Systems Laboratory Stanford University Motivation Vector processors excel at data-level parallelism (DLP) What happens to program
More informationCSCI-564 Advanced Computer Architecture
CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA
More informationCMP N 301 Computer Architecture. Appendix C
CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)
More informationOptimized LU-decomposition with Full Pivot for Small Batched Matrices S3069
Optimized LU-decomposition with Full Pivot for Small Batched Matrices S369 Ian Wainwright High Performance Consulting Sweden ian.wainwright@hpcsweden.se Based on work for GTC 212: 1x speed-up vs multi-threaded
More informationAn Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))
An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) Tung Chou January 5, 2012 QUAD Stream cipher. Security relies on MQ (Multivariate Quadratics). QUAD The Provably-secure QUAD(q, n, r) Stream Cipher
More informationParallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB
Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB Supplemental material Guido Klingbeil, Radek Erban, Mike Giles, and Philip K. Maini This document
More informationScheduling for Reduced CPU Energy
Scheduling for Reduced CPU Energy M. Weiser, B. Welch, A. Demers and S. Shenker Appears in "Proceedings of the First Symposium on Operating Systems Design and Implementation," Usenix Association, November
More informationSimple Instruction-Pipelining. Pipelined Harvard Datapath
6.823, L8--1 Simple ruction-pipelining Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. I fetch decode & eg-fetch execute memory Clock period
More informationPerformance, Power & Energy
Recall: Goal of this class Performance, Power & Energy ELE8106/ELE6102 Performance Reconfiguration Power/ Energy Spring 2010 Hayden Kwok-Hay So H. So, Sp10 Lecture 3 - ELE8106/6102 2 What is good performance?
More informationMicroarchitectural Techniques for Power Gating of Execution Units
2.2 Microarchitectural Techniques for Power Gating of Execution Units Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor Zyuban, Hans Jacobson, Pradip Bose IBM T. J. Watson Research Center ABSTRACT
More informationHardware Acceleration of DNNs
Lecture 12: Hardware Acceleration of DNNs Visual omputing Systems Stanford S348V, Winter 2018 Hardware acceleration for DNNs Huawei Kirin NPU Google TPU: Apple Neural Engine Intel Lake rest Deep Learning
More informationDynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationSISD SIMD. Flynn s Classification 8/8/2016. CS528 Parallel Architecture Classification & Single Core Architecture C P M
8/8/26 S528 arallel Architecture lassification & Single ore Architecture arallel Architecture lassification A Sahu Dept of SE, IIT Guwahati A Sahu Flynn s lassification SISD Architecture ategories M SISD
More informationSimple Instruction-Pipelining. Pipelined Harvard Datapath
6.823, L8--1 Simple ruction-pipelining Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Pipelined Harvard path 6.823, L8--2. fetch decode & eg-fetch execute
More informationPERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah
PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.
More informationECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)
ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC
More informationCRYPTOGRAPHIC COMPUTING
CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,
More informationTHE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University
THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY Daniel Sanchez and Christos Kozyrakis Stanford University MICRO-43, December 6 th 21 Executive Summary 2 Mitigating the memory wall requires large, highly
More informationReal-time signal detection for pulsars and radio transients using GPUs
Real-time signal detection for pulsars and radio transients using GPUs W. Armour, M. Giles, A. Karastergiou and C. Williams. University of Oxford. 15 th July 2013 1 Background of GPUs Why use GPUs? Influence
More informationNCU EE -- DSP VLSI Design. Tsung-Han Tsai 1
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using
More informationMicroprocessor Power Analysis by Labeled Simulation
Microprocessor Power Analysis by Labeled Simulation Cheng-Ta Hsieh, Kevin Chen and Massoud Pedram University of Southern California Dept. of EE-Systems Los Angeles CA 989 Outline! Introduction! Problem
More informationEnhancing the Sniper Simulator with Thermal Measurement
Proceedings of the 18th International Conference on System Theory, Control and Computing, Sinaia, Romania, October 17-19, 214 Enhancing the Sniper Simulator with Thermal Measurement Adrian Florea, Claudiu
More informationSolving RODEs on GPU clusters
HIGH TEA @ SCIENCE Solving RODEs on GPU clusters Christoph Riesinger Technische Universität München March 4, 206 HIGH TEA @ SCIENCE, March 4, 206 Motivation - Parallel Computing HIGH TEA @ SCIENCE, March
More informationDMP. Deterministic Shared Memory Multiprocessing. Presenter: Wu, Weiyi Yale University
DMP Deterministic Shared Memory Multiprocessing 1 Presenter: Wu, Weiyi Yale University Outline What is determinism? How to make execution deterministic? What s the overhead of determinism? 2 What Is Determinism?
More informationCMPEN 411 VLSI Digital Circuits Spring Lecture 19: Adder Design
CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 19: Adder Design [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN 411 L19
More informationPerformance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So
Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION
More informationSimultaneous Branch and Warp Interweaving for Sustained GPU Performance
Simultaneous Branch and Warp Interweaving for Sustained GPU Performance Nicolas Brunie Kalray and ENS Lyon nicolas.brunie@kalray.eu Sylvain Collange Univ. Federal de Minas Gerais sylvain.collange@dcc.ufmg.br
More informationNon-Preemptive and Limited Preemptive Scheduling. LS 12, TU Dortmund
Non-Preemptive and Limited Preemptive Scheduling LS 12, TU Dortmund 09 May 2017 (LS 12, TU Dortmund) 1 / 31 Outline Non-Preemptive Scheduling A General View Exact Schedulability Test Pessimistic Schedulability
More informationDepartment of Electrical and Computer Engineering The University of Texas at Austin
Department of Electrical and Computer Engineering The University of Texas at Austin EE 360N, Fall 2004 Yale Patt, Instructor Aater Suleman, Huzefa Sanjeliwala, Dam Sunwoo, TAs Exam 1, October 6, 2004 Name:
More informationAdministrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application
Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory
More informationReducing NVM Writes with Optimized Shadow Paging
Reducing NVM Writes with Optimized Shadow Paging Yuanjiang Ni, Jishen Zhao, Daniel Bittman, Ethan L. Miller Center for Research in Storage Systems University of California, Santa Cruz Emerging Technology
More informationQuantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)
Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm) Alexander Smith & Khashayar Khavari Department of Electrical and Computer Engineering University of Toronto April 15, 2009 Alexander
More informationJulian Merten. GPU Computing and Alternative Architecture
Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationCMU Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining
CMU 18-447 Introduction to Computer Architecture, Spring 2015 HW 2: ISA Tradeoffs, Microprogramming and Pipelining Instructor: Prof Onur Mutlu TAs: Rachata Ausavarungnirun, Kevin Chang, Albert Cho, Jeremie
More informationIssue = Select + Wakeup. Out-of-order Pipeline. Issue. Issue = Select + Wakeup. OOO execution (2-wide) OOO execution (2-wide)
Out-of-order Pipeline Buffer of instructions Issue = Select + Wakeup Select N oldest, read instructions N=, xor N=, xor and sub Note: ma have execution resource constraints: i.e., load/store/fp Fetch Decode
More informationEnergy-efficient Mapping of Big Data Workflows under Deadline Constraints
Energy-efficient Mapping of Big Data Workflows under Deadline Constraints Presenter: Tong Shu Authors: Tong Shu and Prof. Chase Q. Wu Big Data Center Department of Computer Science New Jersey Institute
More informationLRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation
LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation Jingyang Zhu 1, Zhiliang Qian 2, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationFaster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs
Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline
More informationMicro-architecture Pipelining Optimization with Throughput- Aware Floorplanning
Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Yuchun Ma* Zhuoyuan Li* Jason Cong Xianlong Hong Glenn Reinman Sheqin Dong* Qiang Zhou *Department of Computer Science &
More informationCEC 450 Real-Time Systems
E 450 Real-ime Systems Lecture 4 Rate Monotonic heory Part September 7, 08 Sam Siewert Quiz Results 93% Average, 75 low, 00 high Goal is to learn what you re not learning yet Motivation to keep up with
More informationCMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues
CMPEN 411 VLSI Digital Circuits Spring 2012 Lecture 17: Dynamic Sequential Circuits And Timing Issues [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan,
More informationScalable Store-Load Forwarding via Store Queue Index Prediction
Scalable Store-Load Forwarding via Store Queue Index Prediction Tingting Sha, Milo M.K. Martin, Amir Roth University of Pennsylvania {shatingt, milom, amir}@cis.upenn.edu addr addr addr (CAM) predictor
More informationESE 570: Digital Integrated Circuits and VLSI Fundamentals
ESE 570: Digital Integrated Circuits and VLSI Fundamentals Lec 21: April 4, 2017 Memory Overview, Memory Core Cells Penn ESE 570 Spring 2017 Khanna Today! Memory " Classification " ROM Memories " RAM Memory
More informationThis Unit: Scheduling (Static + Dynamic) CIS 501 Computer Architecture. Readings. Review Example
This Unit: Scheduling (Static + Dnamic) CIS 50 Computer Architecture Unit 8: Static and Dnamic Scheduling Application OS Compiler Firmware CPU I/O Memor Digital Circuits Gates & Transistors! Previousl:!
More informationA Novel Software Solution for Localized Thermal Problems
A Novel Software Solution for Localized Thermal Problems Sung Woo Chung 1,* and Kevin Skadron 2 1 Division of Computer and Communication Engineering, Korea University, Seoul 136-713, Korea swchung@korea.ac.kr
More informationFrom Physics to Logic
From Physics to Logic This course aims to introduce you to the layers of abstraction of modern computer systems. We won t spend much time below the level of bits, bytes, words, and functional units, but
More informationVoltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors
Voltage and Frequency Control With Adaptive Reaction Time in Multiple-Clock-Domain Processors Qiang Wu Philo Juang Margaret Martonosi Douglas W. Clark Depts. of Computer Science and Electrical Engineering
More informationCHAPTER 5 - PROCESS SCHEDULING
CHAPTER 5 - PROCESS SCHEDULING OBJECTIVES To introduce CPU scheduling, which is the basis for multiprogrammed operating systems To describe various CPU-scheduling algorithms To discuss evaluation criteria
More informationComputer Architecture
Computer Architecture QtSpim, a Mips simulator S. Coudert and R. Pacalet January 4, 2018..................... Memory Mapping 0xFFFF000C 0xFFFF0008 0xFFFF0004 0xffff0000 0x90000000 0x80000000 0x7ffff4d4
More informationChapter Overview. Memory Classification. Memory Architectures. The Memory Core. Periphery. Reliability. Memory
SRAM Design Chapter Overview Classification Architectures The Core Periphery Reliability Semiconductor Classification RWM NVRWM ROM Random Access Non-Random Access EPROM E 2 PROM Mask-Programmed Programmable
More informationArchitecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions
In the Proceedings of the International Conference on Dependable Systems and Networks(DSN 07), June 2007. Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions Xiaodong Li,
More informationName: Answers. Mean: 83, Standard Deviation: 12 Q1 Q2 Q3 Q4 Q5 Q6 Total. ESE370 Fall 2015
University of Pennsylvania Department of Electrical and System Engineering Circuit-Level Modeling, Design, and Optimization for Digital Systems ESE370, Fall 2015 Final Tuesday, December 15 Problem weightings
More informationSynthesizing Asynchronous Burst-Mode Machines without the Fundamental-Mode Timing Assumption
Synthesizing synchronous urst-mode Machines without the Fundamental-Mode Timing ssumption Gennette Gill Montek Singh Univ. of North Carolina Chapel Hill, NC, US Contribution Synthesize robust asynchronous
More informationDigital Systems. Validation, verification. R. Pacalet January 4, 2018
Digital Systems Validation, verification R. Pacalet January 4, 2018 2/98 Simulation Extra design tasks Reference model Simulation environment A simulation cannot be exhaustive Can discover a bug Cannot
More informationDigital Integrated Circuits A Design Perspective. Semiconductor. Memories. Memories
Digital Integrated Circuits A Design Perspective Semiconductor Chapter Overview Memory Classification Memory Architectures The Memory Core Periphery Reliability Case Studies Semiconductor Memory Classification
More informationLecture 13. Real-Time Scheduling. Daniel Kästner AbsInt GmbH 2013
Lecture 3 Real-Time Scheduling Daniel Kästner AbsInt GmbH 203 Model-based Software Development 2 SCADE Suite Application Model in SCADE (data flow + SSM) System Model (tasks, interrupts, buses, ) SymTA/S
More informationClock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements.
1 2 Introduction Clock signal in digital circuit is responsible for synchronizing the transfer to the data between processing elements. Defines the precise instants when the circuit is allowed to change
More informationCS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/
CS-206 Concurrency Lecture 13 Wrap Up Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Created by Nooshin Mirzadeh, Georgios Psaropoulos and Babak Falsafi EPFL Copyright 2015 EPFL CS-206 Spring
More informationWorst-Case Execution Time Analysis. LS 12, TU Dortmund
Worst-Case Execution Time Analysis Prof. Dr. Jian-Jia Chen LS 12, TU Dortmund 02, 03 May 2016 Prof. Dr. Jian-Jia Chen (LS 12, TU Dortmund) 1 / 53 Most Essential Assumptions for Real-Time Systems Upper
More informationESE 570: Digital Integrated Circuits and VLSI Fundamentals
ESE 570: Digital Integrated Circuits and VLSI Fundamentals Lec 19: March 29, 2018 Memory Overview, Memory Core Cells Today! Charge Leakage/Charge Sharing " Domino Logic Design Considerations! Logic Comparisons!
More informationInstruction Set Extensions for Reed-Solomon Encoding and Decoding
Instruction Set Extensions for Reed-Solomon Encoding and Decoding Suman Mamidi and Michael J Schulte Dept of ECE University of Wisconsin-Madison {mamidi, schulte}@caewiscedu http://mesaecewiscedu Daniel
More informationBeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power
BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power James C. Hoe Department of ECE Carnegie Mellon niversity Eric S. Chung, et al., Single chip Heterogeneous Computing:
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.
More informationAnalytical Modeling of Parallel Programs (Chapter 5) Alexandre David
Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on
More informationSimple Instruction-Pipelining (cont.) Pipelining Jumps
6.823, L9--1 Simple ruction-pipelining (cont.) + Interrupts Updated March 6, 2000 Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Src1 ( j / ~j ) Src2 ( / Ind) Pipelining Jumps
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationTechnical Report GIT-CERCS. Thermal Field Management for Many-core Processors
Technical Report GIT-CERCS Thermal Field Management for Many-core Processors Minki Cho, Nikhil Sathe, Sudhakar Yalamanchili and Saibal Mukhopadhyay School of Electrical and Computer Engineering Georgia
More informationA COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER. Jesus Garcia and Michael J. Schulte
A COMBINED 16-BIT BINARY AND DUAL GALOIS FIELD MULTIPLIER Jesus Garcia and Michael J. Schulte Lehigh University Department of Computer Science and Engineering Bethlehem, PA 15 ABSTRACT Galois field arithmetic
More informationCMPEN 411 VLSI Digital Circuits Spring Lecture 21: Shifters, Decoders, Muxes
CMPEN 411 VLSI Digital Circuits Spring 2011 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey s Digital Integrated Circuits, Second Edition, 2003 J. Rabaey, A. Chandrakasan, B. Nikolic] Sp11 CMPEN
More information! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.
ESE 57: Digital Integrated Circuits and VLSI Fundamentals Lec 9: March 9, 8 Memory Overview, Memory Core Cells Today! Charge Leakage/ " Domino Logic Design Considerations! Logic Comparisons! Memory " Classification
More informationMaking OpenVX Really Real Time
Making OpenVX Really Real Time Ming Yang, Tanya Amert, Kecheng Yang,, Nathan Otterness, James H. Anderson, F. Donelson Smith, and Shige Wang 3 University of North Carolina at Chapel Hill Texas State University
More information3. (2) What is the difference between fixed and hybrid instructions?
1. (2 pts) What is a "balanced" pipeline? 2. (2 pts) What are the two main ways to define performance? 3. (2) What is the difference between fixed and hybrid instructions? 4. (2 pts) Clock rates have grown
More informationEXAMPLES 4/12/2018. The MIPS Pipeline. Hazard Summary. Show the pipeline diagram. Show the pipeline diagram. Pipeline Datapath and Control
The MIPS Pipeline CSCI206 - Computer Organization & Programming Pipeline Datapath and Control zybook: 11.6 Developed and maintained by the Bucknell University Computer Science Department - 2017 Hazard
More informationFall 2008 CSE Qualifying Exam. September 13, 2008
Fall 2008 CSE Qualifying Exam September 13, 2008 1 Architecture 1. (Quan, Fall 2008) Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for
More informationCMSC 313 Lecture 16 Announcement: no office hours today. Good-bye Assembly Language Programming Overview of second half on Digital Logic DigSim Demo
CMSC 33 Lecture 6 nnouncement: no office hours today. Good-bye ssembly Language Programming Overview of second half on Digital Logic DigSim Demo UMC, CMSC33, Richard Chang Good-bye ssembly
More informationSaving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors
20th Euromicro International Conference on Parallel, Distributed and Network-Based Special Session on Energy-aware Systems Saving Energy in the on Multi-Core Processors Pedro Alonso 1, Manuel F. Dolz 2,
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationS XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library
S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,
More informationMONTE CARLO NEUTRON TRANSPORT SIMULATING NUCLEAR REACTIONS ONE NEUTRON AT A TIME Tony Scudiero NVIDIA
MONTE CARLO NEUTRON TRANSPORT SIMULATING NUCLEAR REACTIONS ONE NEUTRON AT A TIME Tony Scudiero NVIDIA TAKEAWAYS Why Monte Carlo methods are fundamentally different than deterministic methods Inherent Parallelism
More informationDynamic Voltage and Frequency Scaling Under a Precise Energy Model Considering Variable and Fixed Components of the System Power Dissipation
Dynamic Voltage and Frequency Scaling Under a Precise Energy Model Csidering Variable and Fixed Compents of the System Power Dissipati Kihwan Choi W-bok Lee Ramakrishna Soma Massoud Pedram University of
More informationComputer Architecture 10. Fast Adders
Computer Architecture 10 Fast s Ma d e wi t h Op e n Of f i c e. o r g 1 Carry Problem Addition is primary mechanism in implementing arithmetic operations Slow addition directly affects the total performance
More information1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?
1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 2. (2 )What are the two main ways to define performance? 3. (2 )What
More informationSemiconductor Memories. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Paolo Spirito
Semiconductor Memories Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Paolo Spirito Memory Classification Memory Classification Read-Write Memory Non-Volatile Read-Write Memory Read-Only Memory Random
More informationAccelerating Decoupled Look-ahead to Exploit Implicit Parallelism
Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism Raj Parihar Advisor: Prof. Michael C. Huang March 22, 2013 Raj Parihar Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism
More informationLecture 3, Performance
Repeating some definitions: Lecture 3, Performance CPI MHz MIPS MOPS Clocks Per Instruction megahertz, millions of cycles per second Millions of Instructions Per Second = MHz / CPI Millions of Operations
More informationToday. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of
ESE532: System-on-a-Chip Architecture Day 20: November 8, 2017 Energy Today Energy Today s bottleneck What drives Efficiency of Processors, FPGAs, accelerators How does parallelism impact energy? 1 2 Message
More informationEfficient Molecular Dynamics on Heterogeneous Architectures in GROMACS
Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS Berk Hess, Szilárd Páll KTH Royal Institute of Technology GTC 2012 GROMACS: fast, scalable, free Classical molecular dynamics package
More informationGPU Computing with Applications in Digital Logic
Tampere International Center for Signal Processing. TICSP series # 62 J. Astola, M. Kameyama, M. Lukac & R. S. Stanković (eds.) GPU Computing with Applications in Digital Logic Tampere International Center
More informationAnalytical Modeling of Parallel Systems
Analytical Modeling of Parallel Systems Chieh-Sen (Jason) Huang Department of Applied Mathematics National Sun Yat-sen University Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing
More informationSaving Energy in Sparse and Dense Linear Algebra Computations
Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More informationAmdahl's Law. Execution time new = ((1 f) + f/s) Execution time. S. Then:
Amdahl's Law Useful for evaluating the impact of a change. (A general observation.) Insight: Improving a feature cannot improve performance beyond the use of the feature Suppose we introduce a particular
More informationEE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining
Slide 1 EE382 Processor Design Winter 1999 Chapter 2 Lectures Clocking and Pipelining Slide 2 Topics Clocking Clock Parameters Latch Types Requirements for reliable clocking Pipelining Optimal pipelining
More informationCalculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults
Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults Mark Wilkening, Vilas Sridharan, Si Li, Fritz Previlon, Sudhanva Gurumurthi and David R. Kaeli ECE Department, Northeastern
More informationFastSpot: Host-Compiled Thermal Estimation for Early Design Space Exploration
FastSpot: Host-Compiled Thermal Estimation for Early Design Space Exploration Darshan Gandhi, Andreas Gerstlauer, Lizy John Electrical and Computer Engineering The University of Texas at Austin Email:
More informationNikolaj Leischner, Vitaly Osipov, Peter Sanders
Nikolaj Leichner, Vitaly Oipov, Peter Sander Intitut für Theoretiche Informatik - Algorithmik II Oipov: 1 KITVitaly Univerität de Lande Baden-Württemerg und nationale Groforchungzentrum in der Helmholtz-Gemeinchaft
More information[2] Predicting the direction of a branch is not enough. What else is necessary?
[2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting
More information