Loop Interchange. Loop Transformations. Taxonomy. do I = 1, N do J = 1, N S 1 A(I,J) = A(I-1,J) + 1 enddo enddo. Loop unrolling.

Size: px

Start display at page:

Download "Loop Interchange. Loop Transformations. Taxonomy. do I = 1, N do J = 1, N S 1 A(I,J) = A(I-1,J) + 1 enddo enddo. Loop unrolling."

Miranda Gilbert
6 years ago
Views:

1 Advanced Topics Which Loops are Parallel? review Optimization for parallel machines and memory hierarchies Last Time Dependence analysis Today Loop transformations An example - McKinley, Carr, Tseng loop transformations to improve cache performance do = 1, N do J = 1, N S 1 A(,J) = A(,J-1) + 1 do = 1, N do J = 1, N S 2 A(,J) = A(-1,J-1) + 1 do = 1, N S 3 do J = 1, N B(,J) = B(-1,J+1) + 1 J A dependence D = (d 1,...,d k ) is carried at level i, if d i is the first nonzero element of the distance/direction vector. A loop l i is parallel, if a dependence D j carried at level i. Either distance vector direction vector D j d 1,...,d i 1 > 0 d 1,...,d i 1 = < OR d 1,...,d i = 0 d 1,...,d i = = CS 380C Lecture 24 1 Locality Analysis CS 380C Lecture 24 2 Locality Analysis

2 Loop Transformations Loop nterchange Taxonomy Loop unrolling Loop interchange Loop fusion do = 1, N do J = 1, N S 1 A(,J) = A(-1,J) + 1 enddo enddo Loop distribution (a.k.a. fission) Loop skewing Strip mine and interchange (a.k.a. tiling & blocking) Unroll-and-jam (a variety of tiling) do = 1, N do J = 1, N S 2 B(,J) = B(-1,J+1) + 1 enddo enddo J Loop reversal Loop interchange is safe iff it does not reverse the execution order of the source and sink of any dependence in the nest, i.e., if the distance vector would become negative. Enables parallelization of outer and/or inner loops Changes execution order of the statements Can improve reuse J CS 380C Lecture 24 3 Locality Analysis CS 380C Lecture 24 4 Locality Analysis

3 Loop Fusion Loop Distribution s 1 s 2 s 1 s 2 = loop fusion = do i = 2, n a(i) = b(i) do i = 2, n c(i) = b(i) a(i-1) do i = 2, n a(i) = b(i) c(i) = b(i) a(i-1) = loop distribution = s 1 s 2 s 2 s 1 do i = 2, n a(i) = b(i) c(i) = b(i) a(i+1) do i = 2, n c(i) = b(i) a(i+1) do i = 2, n a(i) = b(i) = loop distribution = Loop Fusion is safe iff no forward dependence between nests becomes a backward loop carried dependence. Would fusion be safe if s 2 referenced a(i+1)? Benefits Reuse Eliminates synchronization between parallel loops Reduced loop overhead Loop Distribution is safe iff statements involved in a cycle of dependences (recurrence) remain in the same loop, & if a dependence between two statements placed in different loops, it must be forward. Benefits Partial parallelization Reduces resource requirements Enables other transformations CS 380C Lecture 24 5 Locality Analysis CS 380C Lecture 24 6 Locality Analysis

4 Loop Skewing and nterchange Strip Mine and nterchange do = 1, 100 do J = 2, 100 A(,J) = A(-1,J) + A(,J-1) do = 1, 100 do J = +2, i+100 A(,J) = A(-1,J) + A(,J-1) = Strip Mine = do = 1, n do J = 1, n A(J,) = B(J) C() = nterchange = do = 1, n, tile do =, + tile -1 do J = 1, n A(J,) = B(J) C() do = 1, n, tile do J = 1, n do =, + tile -1 A(J,) = B(J) C() J J Loop skewing is always safe. The original dependences, (1,0) & (0,1) prevent interchange and parallelization. t changes the dependences to (1,1) & (0,1) and after interchange, (1,1) & (1,0), the inner loops is parallel. Enables inner loop parallelization J Strip Mining is always safe. With interchange it enables loop invariant reuse by changing the shape of the iteration space J CS 380C Lecture 24 7 Locality Analysis CS 380C Lecture 24 8 Locality Analysis

5 mproving Reuse with Loop Transformations Motivation: Enable portable programming without sacrificing performance optimization framework cache model compound loop transformation algorithm permutation distribution results transformation simulation performance fusion reversal Optimization Framework 1. improve order of memory accesses to exploit all levels of the memory hierarchy = cache line size 2. Tile to fit in cache, second level cache, TLB = size of cache(s), replacement policy, associativity, 3. register tiling via unroll-and-jam and scalar replacement = number and type of registers Assumptions cls - the cache line size in terms of data items Fortran column-major order interference occurs rarely for small numbers of inner loop iterations CS 380C Lecture 24 9 Locality Analysis CS 380C Lecture Locality Analysis

6 Reuse temporal locality spatial locality reuse of a specific location reuse of adjacent locations (cache lines) do j = 1, 100 do k = 1, 100 do i = 1, 100 c(i,j) = c(i,j) + a(i,k) To Determine Temporal and Spatial Reuse: b(k,j) Loop Cost of dmxpy From Linpackd Loop Cost in cache lines, cls = 4 do j = 1, n2 do i = 1, n1 y(i)= y(i) + x(j) m(i,j) reference group loop i loop j y(i) 1/4 n1 n2 1 n1 x(j) 1 n2 1/4 n2 n1 m(i,j) 1/4 n1 n2 n2 n1 total loop cost 1/2 n1 n2 + n2 5/4 n1 n2 + n1 for each loop l in a nest, consider l innermost group references with temporal locality reference groups compute the cost in cache lines accessed loop cost rank the loops based on their cost = memory order CS 380C Lecture Locality Analysis CS 380C Lecture Locality Analysis

7 Loop permutation key insight: f loop l promotes more reuse than loop k at the innermost position, then it will also promote more reuse at any outer position. memory order (a) is it legal? (b) if not find a nearbypermutation O(n 2 ) = avoids combinatorial search present in other algorithms NearbyPermutation nput: O = {i 1,i 2,...,i n }, the original loop ordering DV = set of original legal direction vectors for l n L = {i σ1,i σ2,...,i σn }, a permutation ofo Output: P a nearby permutation ofo Algorithm: P = /0 ; k = 0 ; m = n whilel /0 for j = 1,m l = l j L if direction vectors for {p 1,..., p k,l} are legal P = {p 1,..., p k,l} L =L {l} ; k = k+1 ; m = m 1 break for endif endfor endwhile CS 380C Lecture Locality Analysis CS 380C Lecture Locality Analysis

8 Matrix Multiply - execution times in seconds 60 Execution Times (in seconds) vs. Loop Organization Sun Sparc2 ntel i860 BM RS/6000 JK KJ JK JK KJ KJ - ORDER Loop Fusion Erlebacher Fortran 90 loops for AD ntegration DO = 2, N X(,1:N) = X(,1:N) - X(-1,1:N) A(,1:N)/B(-1,1:N) B(,1:N) = B(,1:N) - A(,1:N) A(,1:N)/B(-1,1:N) DO = 2, N simple translation to Fortran 77 DO K = 1, N X(,K) = X(,K) - X(-1,K) A(,K)/B(-1,K) DO K = 1, N B(,K) = B(,K) - A(,K) A(,K)/B(-1,K) DO K = 1, N loop fusion & interchange DO = 2, N X(,K) = X(,K) - X(-1,K) A(,K)/B(-1,K) B(,K) = B(,K) - A(,K) A(,K)/B(-1,K) Execution times in seconds. Memory Order Processor Original Distributed Fused Sun Sparc ntel i BM RS Sun Sparc2 ntel i860 BM RS/6000 JK KJ JK JK KJ KJ - ORDER CS 380C Lecture Locality Analysis CS 380C Lecture Locality Analysis

9 Loop Distribution Cholesky Factorization Algorithm Summary DO K = 1,N DO K = 1,N A(K,K) = SQRT(A(K,K)) A(K,K) = SQRT(A(K,K)) DO = K+1,N DO = K+1,N A(,K) = A(,K)/A(K,K) A(,K) = A(,K)/A(K,K) DO J = K+1, DO J = K,N A(,J) = A(,K)*A(J,K) DO =J+1,N A(,J) = A(,K)*A(J,K) 12 = loop distribution & triangular interchange = 10 8 Execution times (in seconds) for each nestl in a set of adjacent nests compute reference groups for each l i compute loop cost for each l i and sort permutation with reversal? fuse inner loops and permute? distribute and permute? 6 fuse nestsl? Sun Sparc2 ntel i860 BM RS/6000 KJ JK KJ KJ JK JK - ORDER CS 380C Lecture Locality Analysis CS 380C Lecture Locality Analysis

10 Number of Programs Results Achieving Memory Order for Loop Nests test suite Perfect Benchmarks SPEC Benchmarks NAS Benchmarks 4 additional programs 16 experiments on our ability to transform programs simulated hit rates for RS/6000 and i860 execution times on an RS/ Original Final <= 20 >=40 >= 60 >= 70 >=80 >= 90 Percentage of Loop Nests in Memory Order CS 380C Lecture Locality Analysis CS 380C Lecture Locality Analysis

11 Number of Programs Achieving Memory Order for nner Loops Simulated Cache Hit Rates cache1: RS/ K cache, 4-way, 128 byte cache line cache2: i860 8K cache, 2-way, 32 byte cache line for RS/ Original Final <= 20 >=40 >= 60 >= 70 >=80 >= 90 Percent of nner Loops in Memory Order = in 12 of 27 programs, optimized procedures started with 100% hit rates = average hit rate for original programs Memory Order Opt. Proc. Hit Rates Perfect Org Prm Fail Cache 1 Cache 2 Club % org final org final adm arc2d bdna dyfesm flo mdg mg3d ocean qcd spec track trfd CS 380C Lecture Locality Analysis CS 380C Lecture Locality Analysis

12 Performance Results in Seconds on RS6000 Program Original Transformed Speedup arc2d dyfesm flo dnasa7 (btrix) dnasa7 (emit) dnasa7 (gmtry) dnasa7(vpenta) applu appsp linpackd simple wave Summary Recap of Transformation Results 80 % of nests were permuted into memory order 85 % of inner loops were permuted into memory order loop permutation is the most effective optimization 229 candidates for fusion, resulting in 80 nests 23 nests were distributed, resulting in 52 nests Observations many programs started out with high hit ratios smaller cache sizes result in higher improvements in hit rates = regardless of the original target architecture, compiler optimizations improve locality for uniprocessors CS 380C Lecture Locality Analysis CS 380C Lecture Locality Analysis

13 Next Time Compiling for an EDGE Arcuitecture Read: D. Burger et al., Compiling for TRPS: Scaling to the End of Silicon with EDGE Architectures, EEE Computer, pp , July CS 380C Lecture Locality Analysis

Dependence Analysis. Dependence Examples. Last Time: Brief introduction to interprocedural analysis. do I = 2, 100 A(I) = A(I-1) + 1 enddo

Dependence Analysis. Dependence Examples. Last Time: Brief introduction to interprocedural analysis. do I = 2, 100 A(I) = A(I-1) + 1 enddo Dependence Analysis Dependence Examples Last Time: Brief introduction to interprocedural analysis Today: Optimization for parallel machines and memory hierarchies Dependence analysis Loop transformations