Cache Oblivious Stencil Computations

Cache Oblivious Stencil Computations S. HUNOLD J. L. TRÄFF F. VERSACI Lectures on High Performance Computing 13 April 2015 F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 1 / 19

References Matteo Frigo and Volker Strumpen. Cache oblivious stencil computations. In: ICS. 2005, pp. 361 366. Matteo Frigo and Volker Strumpen. The memory behavior of cache oblivious stencil computations. In: The Journal of Supercomputing 39.2 (2007), pp. 93 112. F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 2 / 19

Heat Diffusion Problem definition Let x = (x, y, z) be a point in the space Let t be the time Let u(x, t) be the temperature Let α be the thermal diffusivity Heat equation u t = α 2 u ( ) u t = α 2 u x 2 + 2 u y 2 + 2 u z 2 Intuition The Laplacian is a local averaging operator: if the temperature around x is higher than in x, than u(x) will increase accordingly F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 3 / 19

Heat Diffusion Discretization of the unidimensional case We now want to discretize the 1D case: u t = u α 2 x 2 Assuming time horizon τ and length λ, we define: t := τ T x := λ N 1 We approximate the second derivative with the second order finite difference, finally obtaining u(t + t, x) u(t, x) t u(t, x + x) 2u(t, x) + u(t, x x) = α ( x) 2 We also assume some boundary conditions (e.g., fixed values or cyclic space) F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 4 / 19

Heat Diffusion Discretization of the unidimensional case Naive algorithm Consider a matrix V, s.t. V(i, ) := u(i t, x), then V(i + 1, ) = V(i, ) + α t ( ) V(i, + 1) 2V(i, ) + V(i, 1) ( x) 2 We know the initial temperature: V(0, ) We want to compute the temperature at time T: V(T, ) We do not need to keep the whole V in memory (inplace update) We can ust keep the last two rows t and t + 1 by accessing V(i mod 2, ) instead of V(i, ) f o r ( i n t i =0; i < T ; ++i ) f o r ( i n t =0; < N; ++ ) update ( V, ( i +1)%2, ) ; F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 5 / 19

Iterative Methods for Solving Linear Systems We want to solve a linear system: Ax = b With A being a n n matrix We may adopt a direct method (e.g., Gaussian elimination) and compute x = A 1 b This takes O(n 3 ) time and O(n 2 ) space (even if A is sparse) Iterative splitting methods We write A = M + N, with M invertible The equation can be rewritten as x = M 1 (b Nx) Consider the related iteration x (t+1) = M 1 (b Nx (t) ) We are interested in cases in which inverting M is easy (i.e., O(n 2 )) and/or M 1 and N are sparse whenever A is F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 6 / 19

Gauss-Seidel Method Decomposition We decompose A = (a i ) i as A = L + U a 11 0 0 0 a 12 a 1n a L = 21 a 22 0...... U = 0 0 a 2n...... a n1 a n2 a nn 0 0 0 L is lower triangular (diagonal included) and U is upper triangular The iteration is thus x (t+1) = L 1 (b Ux (t) ) Note that n =2 a 1x (t) n =3 a 2x (t) n Ux (t) =. a nn x (t) n 0 = =i+1 a i x (t) i {1,...,n} F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 7 / 19

Gauss-Seidel Method System to be solved We want to solve (for x (t+1) ) a 11 0 0 a 21 a 22 0...... a n1 a n2 a nn Lx (t+1) = b Ux (t) x (t+1) 1 x (t+1) 2. x (t+1) n = q (t) 1 q (t) 2. q (t) n Where q (t) i := b i n =i+1 a i x (t) F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 8 / 19

Gauss-Seidel Method Update iteration Since L is triangular we can solve the system without inverting L, by means of forward substitution: i i a i x (t+1) = q (t) i x (t+1) i = 1 q (t) i 1 a i a i x (t+1) ii =1 Finally, the complete update iteration is x (t+1) i = 1 n b i a i x (t) a ii =i+1 =1 i 1 a i x (t+1) The method converges if and only if all the eigenvalues of the iteration matrix have absolute value less than 1: ρ( L 1 U) < 1 =1 F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 9 / 19

Gauss-Seidel Method Band matrices x (t+1) i = 1 b i a ii When updating x (t) i for > i x (t) x (t+1) i for < i n =i+1 x (t+1) i a i x (t) we use i 1 a i x (t+1) =1 This means that we can update the vector x inplace If the matrix is banded (e.g., tridiagonal), then the value of x (t+1) depends only on some neighbour values F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 10 / 19

Stencil Computations Definition Given a multidimensional array u (0) We construct a sequence of arrays u (t) by updating ( ) u (t+1) (x) = kernel u (t) (y) : y B(x, r) I.e., the new value of u in x is a function of the old values in some neighbourhood of x (of radius r) Typically, u is updated in-place and r is small (e.g., r = 1,..., 3) Examples: PDE, Jacobi and Gauss-Seidel methods, cellular automata, image processing F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 11 / 19

Stencil Computations Naive vs. cache-aware implementations f o r ( t [0, T[ ) f o r ( x X ) update ( u,t + dt,x ) ; Consider a two-level memory with cache size Z and cache line B Let p be the size of computed space ( ) For large p the naive algorithm incurs Θ p B misses (i.e., a miss every time a block is accessed) ( ) p Optimal cache aware algorithms incur Θ misses, n being the number of space dimensions BZ 1 n This can be done by time skewing, i.e., by cleverly exploring the (n + 1)-dimensional timespace F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 12 / 19

Cache Oblivious Algorithm Unidimensional case t w t 1 t t 0 x 0 x 1 x Recursive algorithm to traverse a trapezoid Parameters: t 0, t 1, x 0, x 1, ẋ 0, ẋ 1, ds with ẋ = dx dt and ds = stencil slope Width Average of the parallel sides: w = x 1 x 0 + t(ẋ 1 ẋ 0 ) 2 Volume Number of points in the trapezoid F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 13 / 19

Cache Oblivious Algorithm Unidimensional case Space cut t t 1 T 2 T 1 t 0 x x 0 x 1 x m If wide enough (w 4 t ds), then cut the space The cut is through the center (defined as average of the vertices) The slope of the cut is ds and the mid point x m is x m := x 0 + x 1 2 + t(2 ds + ẋ 0 + ẋ 1 ) 4 F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 14 / 19

Cache Oblivious Algorithm Unidimensional case Time cut t t 1 T 2 t 0 s T 1 x 0 x 1 x If the trapezoid is not wide enough then it is cut horizontally F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 15 / 19

Cache Oblivious Algorithm Multidimensional case n-dimensional trapezoid T(t 0, t 1, x (i) 0, x(i) 1, ẋ(i) 0, ẋ(i) 1 ) It is the set of points (t, x (0),..., x (n 1) ) such that i 0 i < n t 0 t < t 1 x (i) 0 + ẋ (i) 0 (t t 0) x (i) < x (i) 1 + ẋ (i) 1 (t t 0) The proection (t, x (i) ) looks like a unidimensional trapezoid Multidimensional algorithm 1 If possible cut some space dimension 2 Otherwise cut the time dimension F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 16 / 19

Cache Oblivious Algorithm I/O Complexity Lemma Let T be a trapezoid and Vol(T) its (n + 1 dimensional) volume. Let m := min{ t, w 0,..., w n 2 }/2. ) Then the (n-dimensional) surface of Vol(T) has measure O. Theorem ( Vol(T) m Let T be a trapezoid and assume t Ω ( ) ( ) Z n 1 and i w i Ω Z n 1. Then( the number ) of misses incured by the cache oblivious algorithm Vol(T) is O. BZ 1 n F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 17 / 19

Cache Oblivious Algorithm Simulation Problem dimensions = 1 Stencil slope = 1 Block size = 8 elements Vector size = 48 points = 6 blocks Buffer size = 10 blocks Miss time = 30 hit time F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 18 / 19