Origami: Folding Warps for Energy Efficient GPUs

Size: px

Start display at page:

Download "Origami: Folding Warps for Energy Efficient GPUs"

Noel Hubbard
5 years ago
Views:

1 Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong, Justin Huang and Murali Annavaram* * University of Southern alifornia University of alifornia, Riverside Stanford University

2 Outline GPU overview Motivation and related work Warp Folding Origami Scheduler Evaluation 2

3 GPGPU Overview (GTX480) SM Instruction ache Fetch and decode Warp Scheduler (2-level) SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST INT Unit Operands FP Unit Result Queue Register File 128KB SFU LD/ST LD/ST LD/ST LD/ST Execution Units 64KB shared Memory/L1 cache SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST 193

4 GPGPU Overview (GTX480) SM Instruction ache Fetch and decode Warp Scheduler (2-level) SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST INT Unit Operands FP Unit Result Queue Register File 128KB SFU LD/ST LD/ST LD/ST LD/ST Execution Units 64KB shared Memory/L1 cache SFU SFU LD/ST LD/ST LD/ST LD/ST LD/ST 193

5 GPGPU Power Break-Down DRAM EXE RF L M Pipeline onstant NO Other GPUWattch, ISA

6 GPGPU Power Break-Down DRAM EXE RF L M EXE 20.1% Pipeline onstant NO Other GPUWattch, ISA

7 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5

8 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5

9 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5

10 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5

Units 512 1536 2048 RF size 128KB/SM 256KB/SM

11 GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 ores (SMs) Execution Units RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion 5

12 Technology Scaling As technology scales leakage power will increase Accounts for 50% of the execution units power Power Gating can be used to reduce the leakage power Need long idle periods to be effective Warped Gates, MIRO

13 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET 7

14 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity 7

15 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity Energy Loss or Neutral 7

16 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity Energy Loss or Neutral Energy Savings 7

17 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity Energy Loss or Neutral Energy Savings 7

18 Power Gating hallenges in GPGPUs Int. Unit idle period length distribution for hotspot Assume 5 idle detect, 14 BET Lost Opportunity Energy Loss or Neutral Energy Savings Need to increase idle period length 7

19 Warp Scheduler Effect on Power Gating INT FP INT INT FP INTO Ready Warps INT FP 8 8

20 Warp Scheduler Effect on Power Gating INT FP Ready Warps 8 Busy (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) Idle 8

21 Warp Scheduler Effect on Power Gating Ready Warps Idle periods interrupted by instructions that are greedily scheduled (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) (INT) INT (INT) (INT) (INT) (INT) (INT) FP Need to coalesce warp issues (INT) (INT) (INT) (INT) by resource type Busy Idle 8 8

22 Related Work/Warped-Gates* Schedule instructions based on their type Force power gated units to stay in power gating state for at least the breakeven time 30% 54.3% 0.0% 45.7% Frequency 23% 15% 8% 0% Idle period length *Warped-Gates, MIRO

23 Related Work/Warped-Gates* Schedule instructions based on their type Force power gated units to stay in power gating state for at least the breakeven time 30% 54.3% 0.0% 45.7% Frequency 23% 15% 8% 0% Idle period length *Warped-Gates, MIRO

24 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities 10

25 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler INT INT INT INT FP FP FP FP 10

26 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler INT INT INT INT FP FP FP FP 10

27 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT INT FP FP FP FP 10

28 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X INT FP 10

29 Temporal idleness Fine grain idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X ycle X+1 Bubble INT FP 10

30 Fine grain idleness Temporal idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X ycle X+1 Bubble ycle X INT FP 10

31 Fine grain idleness Temporal idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X ycle X+1 Bubble ycle X ycle X+3 Bubble INT FP 10

32 Fine grain idleness Temporal idleness Infrequent issues to the same pipeline Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT INT INT FP FP FP ycle X ycle X+1 Bubble ycle X ycle X+3 Bubble INT FP 10

33 Spatial Idleness Fine grain idleness Lanes have different activity Branch divergence Insufficient parallelism 11

34 Warp Folding Improve the power gating potential by coalescing the pipeline bubbles 12

35 Warp Folding Scheduler Ready Warps Queue Active Mask: Active Mask: Issued Warps Bubble 39

36 Warp Folding Scheduler Ready Warps Queue Active Mask: Active Mask: Issued Warps Bubble Sub_Warp0: Sub_Warp1: 39

37 Warp Folding Scheduler Ready Warps Queue Active Mask: Active Mask: Issued Warps Bubble Sub_Warp0: Sub_Warp1:

38 Warp Folding Scheduler Ready Warps Queue Issued Warps Active Mask: Active Mask: Bubble Sub_Warp0: Sub_Warp1:

39 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1:

40 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1: Sub_Warp0: Sub_Warp1:

41 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1: Sub_Warp0: Sub_Warp1:

42 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1:

43 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1:

44 Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: Sub_Warp1:

45 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask:

46 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: 14

47 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:

48 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:

49 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: Simple - High wiring overhead - Delay 14

50 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:

51 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: 14

52 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: 14

53 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:

54 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1:

55 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: Simple + Low wiring overhead + Small delay +Support for lane shuffling 14

56 Folding Granularity Scheduler Ready Warps Queue Issued Warps Active Mask: Sub_Warp0: Sub_Warp1: Simple + Low wiring overhead + Small delay +Support for lane shuffling 14

57 Warp Folding Pipeline SP Pipe Result ollector 15

58 Warp Folding Pipeline SP Pipe Result ollector 15

59 Warp Folding Pipeline SP Pipe Result ollector SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Selective Write Result ollector 15

60 Example 16

61 Example

62 Example

63 Example

64 Example

65 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16

66 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16

67 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16

68 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16

69 Example SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector SP Pipe Shifting Logic Shifting Logic Re-Shifting Logic Re-Shifting Logic Result ollector 16

70 Origami scheduler Improve the power gating potential by coalescing warps based on: Threads utilization Instruction type 17

71 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads 18

72 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5:

73 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 18

74 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 18

75 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 18

76 Origami scheduler Group the threads based on their active mask One group will have the active mask with less than 32 threads The other group will have the active masks with 32 active threads Active masks are not aligned!!! Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 18

77 Lane Shifting Shift the threads to the lower order SIMT lanes Done at the cluster level to reduce overhead Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 19

78 Lane Shifting Shift the threads to the lower order SIMT lanes Done at the cluster level to reduce overhead Lane#: ycle x: ycle x+1: ycle x+2: ycle x+3: ycle x+4: ycle x+5: Less than 32 group Equal to 32 group 19

79 Origami Scheduling Runtime Warp Folding Algorithm Folds warps long enough to guarantee savings Nphase = Npipelineflush + Nidledetect + Nbreakeventime Adaptive Folding Aggressive folding for warps with lower instruction count onservative folding for warps with higher instruction count hange folding frequency based on application utilization See paper for more detail! 20

80 EVALUATION 21

81 GPGPU-Sim v3.0.2 Nvidia GTX480 Evaluation Methodology GPUWattch and McPAT for energy and area estimation Benchmarks from ISPASS, Rodinia and Parboil Power gating parameters Wakeup delay 3 cycles Breakeven time 14 cycles Idle detect 5 cycles 22

82 Folding Ratio Folding frequency is application dependent 23

83 Folding Ratio Folding frequency is application dependent 23

84 Folding Ratio Folding frequency is application dependent 23

85 Energy Savings/INT Eliminates negative energy savings Origami scheduler able to amplify folding benefits Origami is able to save 49% 24

86 Energy Savings/INT Eliminates negative energy savings Origami scheduler able to amplify folding benefits Origami is able to save 49% 24

87 Energy Savings/FP Eliminates negative energy savings Origami scheduler able to amplify folding benefits Origami is able to save 46% 25

88 Energy Savings/FP Eliminates negative energy savings Origami scheduler able to amplify folding benefits Origami is able to save 46% 25

89 Performance Origami is able to reduce the performance overhead significantly over Warped-Gates Origami scheduler has positive impact on performance for some workloads 26

90 Performance Origami is able to reduce the performance overhead significantly over Warped-Gates Origami scheduler has positive impact on performance for some workloads 26

91 onclusion Execution units energy efficiency is critical Take advantage of the spatial and temporal idleness to Improve the power gating potential Warp folding Adaptively fold warp to coalesce bubbles Origmai scheduler Scheduler warps based on the threads activity and type. Able to save 49% and 46% of the execution units leakage energy Negligible performance overhead 27

92 Questions? Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong, Justin Huang and Murali Annavaram* * University of Southern alifornia University of alifornia, Riverside Stanford University

93 THANK YOU! 29

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SR and Freescale under SR task number 2042 Lukasz G. Szafaryn, Kevin Skadron Department of omputer Science