Performance Prediction for Tensor Contractions

Size: px

Start display at page:

Download "Performance Prediction for Tensor Contractions"

Primrose Daniels
5 years ago
Views:

1 1 / 18 Performance Prediction for Tensor Contractions Paolo Bientinesi, Edoardo Di Napoli, Diego Fabregat, Elmar Peise AICES, RWTH Aachen pauldj@aices.rwth-aachen.de June 3rd, 214 PASCConference 14 Zürich, Switzerland

2 Tensors Crash course (1/2) 2 / 18 MATHEMATICS, PHYSICS multilinear map multidimensional array + metric COMPUTER SCIENCE multidimensional array

3 Tensors Crash course (1/2) 2 / 18 MATHEMATICS, PHYSICS multilinear map multidimensional array + metric COMPUTER SCIENCE multidimensional array t dimensional tensors S αβ...γδ, S β γ α... δ, S...γδ αβ,... }{{} t indices

4 Tensors Crash course (1/2) 2 / 18 MATHEMATICS, PHYSICS multilinear map multidimensional array + metric COMPUTER SCIENCE multidimensional array t dimensional tensors S αβ...γδ, S β γ α... δ, S...γδ αβ,... }{{} t indices Operations low dimensional approximations contractions

5 Tensors Crash course (1/2) 2 / 18 MATHEMATICS, PHYSICS multilinear map multidimensional array + metric COMPUTER SCIENCE multidimensional array t dimensional tensors S αβ...γδ, S β γ α... δ, S...γδ αβ,... }{{} t indices Operations low dimensional approximations contractions Examples: S αij T ij, S αiβ M ik T kγ, S αij M ik T kh M hj,...

6 Tensors Crash course (1/2) 2 / 18 MATHEMATICS, PHYSICS multilinear map multidimensional array + metric COMPUTER SCIENCE multidimensional array t dimensional tensors S αβ...γδ, S β γ α... δ, S...γδ αβ,... }{{} t indices Operations low dimensional approximations contractions Examples: S αij T ij, S αiβ M ik T kγ, S αij M ik T kh M hj,...

7 Crash course (2/2) 3 / 18 Contraction S αij T jδi α, δ free indices i, j contracted indices

8 Crash course (2/2) 3 / 18 Contraction S αij T jδi α, δ free indices i, j contracted indices V αδ = S αij T jδi α δ v αδ = i s αij t jδi j

9 Crash course (2/2) 3 / 18 Contraction S αij T jδi α, δ free indices i, j contracted indices V αδ = S αij T jδi α δ v αδ = i s αij t jδi j Storage S αβγ... α stride 1 β stride α γ stride α β.

10 A well known contraction (1/2) 4 / 18 C ij = A ik B kj

11 A well known contraction (1/2) 4 / 18 C ij = A ik B kj Direct call C := GEMM(A,B)

12 A well known contraction (1/2) 4 / 18 C ij = A ik B kj Direct call C := GEMM(A,B) 1 A is sliced horizontally C := AB = a 1. a m B = a 1 B. a m B for i=1,..., Ci:=GEMV(Ai,B)

13 A well known contraction (1/2) 4 / 18 C ij = A ik B kj Direct call C := GEMM(A,B) 1 A is sliced horizontally C := AB = a 1. a m B = a 1 B. a m B for i=1,..., Ci:=GEMV(Ai,B) 2 B is sliced vertically C := AB = A[b 1 b 2... b n ] = [Ab 1 Ab 2... Ab n ] for i=1,..., Ci:=GEMV(A,Bi)

14 A well known contraction (2/2) 5 / 18 C ij = A ik B kj 3 A is sliced vertically and B horizontally b 1 [a 1... a k ]. = a 1 b 1 + a 2 b a k b k b k for i=1,..., C+=GER(Ai,Bi)

15 A well known contraction (2/2) 5 / 18 C ij = A ik B kj 3 A is sliced vertically and B horizontally b 1 [a 1... a k ]. = a 1 b 1 + a 2 b a k b k b k for i=1,..., C+=GER(Ai,Bi) 4 A is sliced horizontally and B vertically a 1 a 1 b 1... a 1 b n. [b 1... b n ] =.... a m a m b 1 a m b n for i=1,..., for j=1,..., Cij:=DOT(Ai,Bj)

16 Mathematically equivalent, but... All experiments: OpenBLAS.2.8, Intel IvyBridge_EP E5-268v2 flops / cycle GEMM GER GEMV GEMV DOT DOT 5 1, 1,5 2, 2,5 k 6 / 18

17 How to use BLAS for contractions? 7 / 18 R αβγ := T αβσ S σγ

18 How to use BLAS for contractions? 7 / 18 R αβγ := T αβσ S σγ 1) Slicing along β R α1γr α2γ R α3γ R αnγ T α1σ T α2σ T α3σ T αnσ S σγ GEMM

19 How to use BLAS for contractions? 7 / 18 R αβγ := T αβσ S σγ 2) Slicing along α R mβγ T mβσ S σγ R 3βγ R 2βγ R 1βγ T 3βσ T 2βσ T 1βσ GEMM Transposition + GEMM

20 How to use BLAS for contractions? 7 / 18 R αβγ := T αβσ S σγ 3) Slicing along α and β S σγ GEMV

21 Taxonomy 8 / 18 V h1 h 2... := S i1 i 2...T j1 j 2... Definition: (X) = # of free indices of X

22 Taxonomy 8 / 18 V h1 h 2... := S i1 i 2...T j1 j 2... Definition: (X) = # of free indices of X Class 1: (S) = (T ) = BLAS3 BLAS2 BLAS1

23 Taxonomy 8 / 18 V h1 h 2... := S i1 i 2...T j1 j 2... Definition: (X) = # of free indices of X Class 1: (S) = (T ) = BLAS3 BLAS2 BLAS1 Class 2: (S) 1 (T ) = or (S) = (T ) 1 BLAS3 BLAS2 (+ transp) BLAS1

24 Taxonomy V h1 h 2... := S i1 i 2...T j1 j 2... Definition: (X) = # of free indices of X Class 1: (S) = (T ) = BLAS3 BLAS2 BLAS1 Class 2: (S) 1 (T ) = or (S) = (T ) 1 BLAS3 BLAS2 (+ transp) BLAS1 Class 3: (S) 1 (T ) 1 BLAS3 (+ transp) BLAS2 BLAS1 Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions, Applied Mathematics and Computation, / 18

25 Nice and easy... 9 / 18 V bcd := S ijb T icjd

26 Nice and easy variants 9 / 18 flops / cycle V bcd := S ijb T icjd GEMM GER GEMV DOT , b = c = d

27 Small dimensions... flops / cycle GEMM GER GEMV GEMV DOT DOT k 1 / 18

28 Small dimensions... flops / cycle GEMM GER GEMV GEMV DOT DOT m = n = k 1 / 18

29 Small dimensions... flops / cycle GEMM GER GEMV GEMV DOT DOT flops / cycle GEMM GER GEMV GEMV DOT DOT m = n = k m = n = k 1 / 18

30 11 / 18 Goal Automatic selection of the best variants Idea Performance prediction Approach - Kernels execution - Algorithms execution Challenges Fluctuations uncertainties Cache influence Solution - Performance models - Context-aware timings

31 Fluctuations performance models 2 15 σ/ x m = n = k 12 / 18

32 Fluctuations performance models 1,24 2% % n ,24 m 1% 5% % Performance Modeling for Dense Linear Algebra, E. Peise, P.B., PMBS12 (SC12). 12 / 18

33 Fluctuations performance models 1,24 2% % n ,24 m 1% 5% % Performance Modeling for Dense Linear Algebra, E. Peise, P.B., PMBS12 (SC12). 12 / 18

34 Models... Timings? 13 / 18 Observation: typical linear algebra algorithms shrinking active region tensor contractions identical size slices

35 Models... Timings? 13 / 18 Observation: typical linear algebra algorithms shrinking active region tensor contractions identical size slices V a := S aij T ij flops / cycle GEMV GEMV DOT DOT DOT DOT , a = i = j

36 Models... Timings? 13 / 18 Observation: typical linear algebra algorithms shrinking active region tensor contractions identical size slices V a := S aij T ij flops / cycle GEMV GEMV DOT DOT DOT DOT flops / cycle , a = i = j , a = i = j

37 Influence of caching (1/2) 6 14 measured independent timing #cycles 4 2 for i=1,..., c_i:=gemv(a_i,b) invocation of GEMV 14 / 18

38 Influence of caching (1/2) 6 14 measured independent timing #cycles 4 2 for i=1,..., c_i:=gemv(a_i,b) Idea: cache setup invocation of GEMV 14 / 18

39 Influence of caching (1/2) 6 14 measured independent timing cache aware timing #cycles 4 2 for i=1,..., c_i:=gemv(a_i,b) Idea: cache setup invocation of GEMV 14 / 18

40 Influence of caching (2/2) measured cache aware timing.8 #cycles.6.4 for i=1,..., for j=1,..., A_i:=GER(ai,bj) invocation of GER 15 / 18

41 Influence of caching (2/2) measured cache aware timing.8 #cycles for i=1,..., for j=1,..., A_i:=GER(ai,bj) Idea: first iteration invocation of GER 15 / 18

42 Influence of caching (2/2) measured cache aware timing loop aware timing #cycles for i=1,..., for j=1,..., A_i:=GER(ai,bj) Idea: first iteration invocation of GER 15 / 18

43 (Nice) results 16 / 18 V a := S aij T ij flops / cycle GEMV GEMV DOT DOT DOT DOT flops / cycle , a = i = j , a = i = j

44 (So so) results 17 / 18 V bcd := S ijb T icjd flops / cycle 2 flops / cycle , b = c = d , b = c = d

45 (Awesome) results 18 / 18 V bcd := S ijb T icjd flops / cycle 4 flops / cycle , b = c = d , b = c = d

46 Conclusions 18 / 18 Tensor contractions algorithmic space need for BLAS4? automation? 8 matrix operations LARGE! maybe Yes, please! flops / cycle 4 flops / cycle , b = c = d , b = c = d

A Compiler for Linear Algebra Operations

A Compiler for Linear Algebra Operations Paolo Bientinesi In collaboration with Diego Fabregat AICES, RWTH Aachen pauldj@aices.rwth-aachen.de CScADS Autotuning Workshop 2012 August 13-14, 2012 Snowbird,