Cache Oblivious Stencil Computations S. HUNOLD J. L. TRÄFF F. VERSACI Lectures on High Performance Computing 13 April 2015 F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 1 / 19
References Matteo Frigo and Volker Strumpen. Cache oblivious stencil computations. In: ICS. 2005, pp. 361 366. Matteo Frigo and Volker Strumpen. The memory behavior of cache oblivious stencil computations. In: The Journal of Supercomputing 39.2 (2007), pp. 93 112. F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 2 / 19
Heat Diffusion Problem definition Let x = (x, y, z) be a point in the space Let t be the time Let u(x, t) be the temperature Let α be the thermal diffusivity Heat equation u t = α 2 u ( ) u t = α 2 u x 2 + 2 u y 2 + 2 u z 2 Intuition The Laplacian is a local averaging operator: if the temperature around x is higher than in x, than u(x) will increase accordingly F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 3 / 19
Heat Diffusion Discretization of the unidimensional case We now want to discretize the 1D case: u t = u α 2 x 2 Assuming time horizon τ and length λ, we define: t := τ T x := λ N 1 We approximate the second derivative with the second order finite difference, finally obtaining u(t + t, x) u(t, x) t u(t, x + x) 2u(t, x) + u(t, x x) = α ( x) 2 We also assume some boundary conditions (e.g., fixed values or cyclic space) F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 4 / 19
Heat Diffusion Discretization of the unidimensional case Naive algorithm Consider a matrix V, s.t. V(i, ) := u(i t, x), then V(i + 1, ) = V(i, ) + α t ( ) V(i, + 1) 2V(i, ) + V(i, 1) ( x) 2 We know the initial temperature: V(0, ) We want to compute the temperature at time T: V(T, ) We do not need to keep the whole V in memory (inplace update) We can ust keep the last two rows t and t + 1 by accessing V(i mod 2, ) instead of V(i, ) f o r ( i n t i =0; i < T ; ++i ) f o r ( i n t =0; < N; ++ ) update ( V, ( i +1)%2, ) ; F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 5 / 19
Iterative Methods for Solving Linear Systems We want to solve a linear system: Ax = b With A being a n n matrix We may adopt a direct method (e.g., Gaussian elimination) and compute x = A 1 b This takes O(n 3 ) time and O(n 2 ) space (even if A is sparse) Iterative splitting methods We write A = M + N, with M invertible The equation can be rewritten as x = M 1 (b Nx) Consider the related iteration x (t+1) = M 1 (b Nx (t) ) We are interested in cases in which inverting M is easy (i.e., O(n 2 )) and/or M 1 and N are sparse whenever A is F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 6 / 19
Gauss-Seidel Method Decomposition We decompose A = (a i ) i as A = L + U a 11 0 0 0 a 12 a 1n a L = 21 a 22 0...... U = 0 0 a 2n...... a n1 a n2 a nn 0 0 0 L is lower triangular (diagonal included) and U is upper triangular The iteration is thus x (t+1) = L 1 (b Ux (t) ) Note that n =2 a 1x (t) n =3 a 2x (t) n Ux (t) =. a nn x (t) n 0 = =i+1 a i x (t) i {1,...,n} F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 7 / 19
Gauss-Seidel Method System to be solved We want to solve (for x (t+1) ) a 11 0 0 a 21 a 22 0...... a n1 a n2 a nn Lx (t+1) = b Ux (t) x (t+1) 1 x (t+1) 2. x (t+1) n = q (t) 1 q (t) 2. q (t) n Where q (t) i := b i n =i+1 a i x (t) F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 8 / 19
Gauss-Seidel Method Update iteration Since L is triangular we can solve the system without inverting L, by means of forward substitution: i i a i x (t+1) = q (t) i x (t+1) i = 1 q (t) i 1 a i a i x (t+1) ii =1 Finally, the complete update iteration is x (t+1) i = 1 n b i a i x (t) a ii =i+1 =1 i 1 a i x (t+1) The method converges if and only if all the eigenvalues of the iteration matrix have absolute value less than 1: ρ( L 1 U) < 1 =1 F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 9 / 19
Gauss-Seidel Method Band matrices x (t+1) i = 1 b i a ii When updating x (t) i for > i x (t) x (t+1) i for < i n =i+1 x (t+1) i a i x (t) we use i 1 a i x (t+1) =1 This means that we can update the vector x inplace If the matrix is banded (e.g., tridiagonal), then the value of x (t+1) depends only on some neighbour values F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 10 / 19
Stencil Computations Definition Given a multidimensional array u (0) We construct a sequence of arrays u (t) by updating ( ) u (t+1) (x) = kernel u (t) (y) : y B(x, r) I.e., the new value of u in x is a function of the old values in some neighbourhood of x (of radius r) Typically, u is updated in-place and r is small (e.g., r = 1,..., 3) Examples: PDE, Jacobi and Gauss-Seidel methods, cellular automata, image processing F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 11 / 19
Stencil Computations Naive vs. cache-aware implementations f o r ( t [0, T[ ) f o r ( x X ) update ( u,t + dt,x ) ; Consider a two-level memory with cache size Z and cache line B Let p be the size of computed space ( ) For large p the naive algorithm incurs Θ p B misses (i.e., a miss every time a block is accessed) ( ) p Optimal cache aware algorithms incur Θ misses, n being the number of space dimensions BZ 1 n This can be done by time skewing, i.e., by cleverly exploring the (n + 1)-dimensional timespace F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 12 / 19
Cache Oblivious Algorithm Unidimensional case t w t 1 t t 0 x 0 x 1 x Recursive algorithm to traverse a trapezoid Parameters: t 0, t 1, x 0, x 1, ẋ 0, ẋ 1, ds with ẋ = dx dt and ds = stencil slope Width Average of the parallel sides: w = x 1 x 0 + t(ẋ 1 ẋ 0 ) 2 Volume Number of points in the trapezoid F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 13 / 19
Cache Oblivious Algorithm Unidimensional case Space cut t t 1 T 2 T 1 t 0 x x 0 x 1 x m If wide enough (w 4 t ds), then cut the space The cut is through the center (defined as average of the vertices) The slope of the cut is ds and the mid point x m is x m := x 0 + x 1 2 + t(2 ds + ẋ 0 + ẋ 1 ) 4 F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 14 / 19
Cache Oblivious Algorithm Unidimensional case Time cut t t 1 T 2 t 0 s T 1 x 0 x 1 x If the trapezoid is not wide enough then it is cut horizontally F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 15 / 19
Cache Oblivious Algorithm Multidimensional case n-dimensional trapezoid T(t 0, t 1, x (i) 0, x(i) 1, ẋ(i) 0, ẋ(i) 1 ) It is the set of points (t, x (0),..., x (n 1) ) such that i 0 i < n t 0 t < t 1 x (i) 0 + ẋ (i) 0 (t t 0) x (i) < x (i) 1 + ẋ (i) 1 (t t 0) The proection (t, x (i) ) looks like a unidimensional trapezoid Multidimensional algorithm 1 If possible cut some space dimension 2 Otherwise cut the time dimension F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 16 / 19
Cache Oblivious Algorithm I/O Complexity Lemma Let T be a trapezoid and Vol(T) its (n + 1 dimensional) volume. Let m := min{ t, w 0,..., w n 2 }/2. ) Then the (n-dimensional) surface of Vol(T) has measure O. Theorem ( Vol(T) m Let T be a trapezoid and assume t Ω ( ) ( ) Z n 1 and i w i Ω Z n 1. Then( the number ) of misses incured by the cache oblivious algorithm Vol(T) is O. BZ 1 n F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 17 / 19
Cache Oblivious Algorithm Simulation Problem dimensions = 1 Stencil slope = 1 Block size = 8 elements Vector size = 48 points = 6 blocks Buffer size = 10 blocks Miss time = 30 hit time F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 18 / 19
References Matteo Frigo and Volker Strumpen. Cache oblivious stencil computations. In: ICS. 2005, pp. 361 366. Matteo Frigo and Volker Strumpen. The memory behavior of cache oblivious stencil computations. In: The Journal of Supercomputing 39.2 (2007), pp. 93 112. F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 19 / 19