- PDF Free Download

Size: px

Start display at page:

Download ""

Rhoda Marshall
6 years ago
Views:

16 Fortran program + Partial data layout specifications Data Layout Assistant.. regular problems. dynamic remapping allowed Invoked only a few times Not part of the compiler Can use expensive techniques HPF program with Total data layout specifications Target HPF Compiler Target Machine Object Code

42 REAL c(n, N), a(n, N), b(n, N) // READ (c, a, b) DO iter = 1, max // Forward and backward sweeps along rows DO j = 2, N DO i = 1, N c(i, j) = c(i, j) - c(i, j - 1) * a(i, j) / b(i, j - 1) b(i, j) = b(i, j) - a(i, j) * a(i, j) / b(i, j - 1) ENDDO ENDDO DO i = 1, N c(i, N) = c(i, N) / b(i, N) ENDDO DO j = N - 1, 1, -1 DO i = 2, N c(i, j) = ( c(i, j) - a(i, j + 1) * c(i, j + 1) ) / b(i, j) ENDDO ENDDO // Downward and upward sweeps along columns DO j = 1, N DO i = 2, N c(i, j) = c(i, j) - c(i - 1, j) * a(i, j) / b(i - 1, j) b(i, j) = b(i, j) - a(i, j) * a(i, j) / b(i - 1, j) ENDDO ENDDO DO j = 1, N c(n, j) = c(n, j) / b(n, j) ENDDO DO j = 1, N DO i = N - 1, 1, -1 ENDDO ENDDO ENDDO // WRITE (c, b) c(i, j) = ( c(i, j) - a(i + 1, j) * c(i + 1, j) ) / b(i, j)

43 REAL c(n, N), a(n, N), b(n, N) REAL c(n, N), a(n, N), b(n, N) // Static column-wise layout // Dynamic row and column-wise layout!hpf$ TEMPLATE X(N, N)!HPF$ TEMPLATE X(N, N)!HPF$ ALIGN c(i, j), a(i, j), b(i, j) WITH X(i, j)!hpf$ DYNAMIC X!HPF$ DISTRIBUTE X(*, BLOCK)!HPF$ ALIGN c(i, j), a(i, j), b(i, j) WITH X(i, j)...!hpf$ DISTRIBUTE X(*, BLOCK)... DO iter = 1, max DO iter = 1, max // Forward and backward sweeps along rows // Forward and backward sweeps along rows...!hpf$ REDISTRIBUTE X(BLOCK, *)... // Downward and upward sweeps along columns // Downward and upward sweeps along columns ENDDO...!HPF$ REDISTRIBUTE X(*, BLOCK) ENDDO

45 REAL c(n, N), a(n, N), b(n, N) // READ (c, a, b) 1 DO iter = 1, max // Forward and backward sweeps along rows DO j = 2, N DO i = 1, N c(i, j) = c(i, j) - c(i, j - 1) * a(i, j) / b(i, j - 1) b(i, j) = b(i, j) - a(i, j) * a(i, j) / b(i, j - 1) ENDDO ENDDO DO i = 1, N c(i, N) = c(i, N) / b(i, N) ENDDO DO j = N - 1, 1, -1 DO i = 2, N c(i, j) = ( c(i, j) - a(i, j + 1) * c(i, j + 1) ) / b(i, j) ENDDO ENDDO // Downward and upward sweeps along columns DO j = 1, N DO i = 2, N c(i, j) = c(i, j) - c(i - 1, j) * a(i, j) / b(i - 1, j) b(i, j) = b(i, j) - a(i, j) * a(i, j) / b(i - 1, j) ENDDO ENDDO DO j = 1, N c(n, j) = c(n, j) / b(n, j) ENDDO DO j = 1, N DO i = N - 1, 1, -1 ENDDO ENDDO c(i, j) = ( c(i, j) - a(i + 1, j) * c(i + 1, j) ) / b(i, j) c, a, b DO 2 c, a, b c, b 3 4 c, a, b 5 c, a, b c, b 7 c, a, b c, b iter = 1, max 6 8 ENDDO // WRITE (c, b) 8 PCFG

51 DO i = 1, n y(i, 1) = x(i, 1) + x(1, i) ENDDO NODE Constraints Each node is in exactly one partition y 11 + y 12 =1 y y =1 y 1 x 1 x x =1 x x = y 11 y 12 y 21 y 22 y 2 CAG x x $ y $ y 11 x 2 x $ y x x 11 x $ y 11 x 21 x 22 Two dimensions of the same array must not be in the same partition y 1 + y 21 < 1 1 y 12 + y 22 < 1 x 11 + x 21 < An edge is switched on IFF the source and sink are switched on IN-constraints: x $ y x $ y < y $ y 12 x x $ y 1 2 < y 12 OUT-constraints: 11 x $ y 11 < x x $ y 11 < x 12 x < 22 $ y x 2 x $ y < x x 12 EDGE Constraints + x < 22 1

52 a 1 b 1 a 1 b 1 a 2 b 2 { a 1 b 1 a 2 b 2 } a 2 b 2 { a 1 b 2 a 2 b 1 } a 1 b 1 a 1 b 1 a 1 b 1 a 1 b 1 a 2 b 2 a 2 b 2 a 2 b 2 a 2 b 2 { a 1 b 1 a 2 b 2 } { a 1 a 2 b 2 b 1 } { a 1 a 2 b 1 b 2 } { a 1 b 2 a 2 b 1 } a 1 b 1 a 2 b 2 {a 1 a 2 b 1 b 2 }

61 p p p p k

66 TEMPLATE PROG_TEMPLATE(N, N, N) ALIGN A(I, J, K) WITH PROG_TEMPLATE(I, J, K) do 10 k = 1, N do 10 j = 1, N do 10 i = j, N 10 A(i, j, k) =... I K J A

67 1 c, a, b iter = 1, max DO P 1 P 2 P c, a, b P 4 TEMPLATE PROG (N, N) ALIGN c(i, J), a(i, J), b(i, J) WITH PROG(I, J) c, b 3 4 c, a, b 5 c, a, b PROCESSORS PROCS(8) DISTRIBUTE PROG (BLOCK, *) ONTO PROCS PROCESSORS PROCS(8) DISTRIBUTE PROG (*, BLOCK) ONTO PROCS c, b 6 7 c, a, b PROCESSORS PROCS(2, 4) PROCESSORS PROCS(4, 2) DISTRIBUTE PROG (BLOCK, BLOCK) DISTRIBUTE PROG (BLOCK, BLOCK) ONTO PROCS ONTO PROCS c, b 8 P 5... P 6... P 7... P 8... PCFG Candidate Layout Search Spaces

78 1 c, a, b iter = 1, max DO cab ca b 3T P 1 2 c, a, b cab 3T (max-1) cab P 2 c, b 3 cb 2T max T max c b P 3 4 c, a, b cab 2T max cab P 4 5 c, a, b cab 3T max cab P 5 c, b 6 cb 2T max T max c b P 6 7 c, a, b cab 2T max cab P 7 c, b 8 cb 2T c b P 8 row layout column layout remapping of c, a, and b remapping of a remapping of c and b PCFG DLG

83 F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } cost of layout is 0 cost of remapping is 0 cost of layout is 1 cost of remapping is 1

84 F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } F F F F F T F T F F T T T F F T F T T T F T T T p = { v, v, v } cost of layout is 0 cost of remapping is 0 cost of layout is 1 cost of remapping is 1

86 DO i = 1, N 1 P 1 entry candidate layouts PCFG 2 3 N N N-1 loop structure DLG P P P exit candidate layouts loop summary DLG

87 1 P 1 entry candidate layouts 2 IF i = 1, N 3 P 2 P 3 FI 4 P 4 exit candidate layouts PCFG branch structure DLG branch summary DLG

88 entry node 1 P 1 2 P 2 3 P 3 exit node PCFG outermost DLG

90 P c a b x ab c x 1 2 x 1 x = 1 P 2 c a b cab x 2 1 x 2 2 x + 2 x = 1 P cb c b 3 x 3 1 x 3 2 x + x = 1 row layout column layout

92 x + x x 31 + x = IN constraint 2 x 4 1 P P 2 3 cab x 2 1 cab x 2 2 cb x 3 1 c b x 3 2 x x = x x 41 + x 41 = x 4 1 OUT contraint P 4 cab x 4 1 IN constraints OUT contraint x x = x 4 1 P 5 cab x 5 cab 1 x 5 2 x x = x 4 1 compact formulation row layout remapping of c, a, and b remapping of a column layout remapping of c and b disaggregated formulation

93 P 2 cab cab x 2 2 P P 3 4 cb cab x 4 1 c b cab x 22 + x 41 > 2 x x + 22 x < x row layout column layout remapping of a

100

101

102

103

104 8 x 104 Training Set for SHIFT Patterns (8 Processors) 7 Execution Time in Micro Seconds high latency, unit high latency, non unit low latency, non unit 1 low latency, unit Message Size in Bytes x 10 4

105 14 x 104 Training Set for Broadcast Pattern with Unit Stride 12 Execution Time in Micro Seconds procs 16 procs 8 procs 4 procs 2 2 procs Message Size in Bytes x 10 4

106 time in seconds double, 16 processors, 512 x 512 measured time estimated time row column transpose

107

108 Execution Time in Seconds Execution Time in Seconds Measured static row static column remapped Estimated Number of Processors

109 Execution Time in Seconds Execution Time in Seconds Measured 8 static 1. dimension static 2. dimension 6 static 3. dimension remapped Estimated Number of Processors

110 60 40 Measured static row static column remapped 20 Execution Time in Seconds Estimated (pre determined branch probabilites) Estimated (default branch probabilities) Number of Processors

111

112 Execution Time in Seconds Execution Time in Seconds Measured static row static column Estimated Number of Processors

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

Loop Interchange. Loop Transformations. Taxonomy. do I = 1, N do J = 1, N S 1 A(I,J) = A(I-1,J) + 1 enddo enddo. Loop unrolling.

Loop Interchange. Loop Transformations. Taxonomy. do I = 1, N do J = 1, N S 1 A(I,J) = A(I-1,J) + 1 enddo enddo. Loop unrolling. Advanced Topics Which Loops are Parallel? review Optimization for parallel machines and memory hierarchies Last Time Dependence analysis Today Loop transformations An example - McKinley, Carr, Tseng loop