GRAPHITE Two Years After First Lessons Learned From Real-World Polyhedral Compilation

Size: px

Start display at page:

Download "GRAPHITE Two Years After First Lessons Learned From Real-World Polyhedral Compilation"

Scott Robertson
5 years ago
Views:

1 GRAPHITE Two Years After First Lessons Learned From Real-World Polyhedral Compilation Konrad Trifunovic 2 Albert Cohen 2 David Edelsohn 3 Li Feng 6 Tobias Grosser 5 Harsha Jagasia 1 Razya Ladelsky 4 Sebastian Pop 1 Jan Sjödin 1 Ramakrishna Upadrasta 2 1 Open Source Compiler Engineering, AMD, Austin, Texas, USA 2 INRIA Saclay Île-de-France and LRI, Paris-Sud 11 University, Orsay, France 3 IBM T. J. Watson Research, Yorktown Heights, USA 4 IBM Haifa Research, Haifa, Israel 5 University of Passau, Passau, Germany 6 Xi an Jiaotong University, Xi an, China January 30, 2010 GROW Workshop, Jan 2010, Pisa, Italy 1 / 13

2 1.Motivation Keeping sustained performance increase GROW Workshop, Jan 2010, Pisa, Italy 2 / 13

3 1.Motivation Keeping sustained performance increase Multi-level parallelism (ILP) Instruction-Level-Parallelism (instruction scheduling) Data-level parallelism (vectorization) Thread-level parallelism (automatic parallelization) GROW Workshop, Jan 2010, Pisa, Italy 2 / 13

4 1.Motivation Keeping sustained performance increase Multi-level parallelism (ILP) Instruction-Level-Parallelism (instruction scheduling) Data-level parallelism (vectorization) Thread-level parallelism (automatic parallelization) Memory hierarchy Caches Registers Scratchpad memories GROW Workshop, Jan 2010, Pisa, Italy 2 / 13

5 1.Motivation Keeping sustained performance increase Multi-level parallelism (ILP) Instruction-Level-Parallelism (instruction scheduling) Data-level parallelism (vectorization) Thread-level parallelism (automatic parallelization) Memory hierarchy Caches Registers Scratchpad memories Need for complex program (loop) optimizations GROW Workshop, Jan 2010, Pisa, Italy 2 / 13

6 2.Why polyhedral model in GCC? Why polyhedral model in GCC? GROW Workshop, Jan 2010, Pisa, Italy 3 / 13

7 2.Why polyhedral model in GCC? Why polyhedral model in GCC? Source to source compilers Syntax based Output source code might lose semantical information Need for source code normalization GROW Workshop, Jan 2010, Pisa, Italy 3 / 13

8 2.Why polyhedral model in GCC? Why polyhedral model in GCC? Source to source compilers Syntax based Output source code might lose semantical information Need for source code normalization Low level internal polyhedral representation Semantics based SSA GIMPLE form Scalar evolution analysis (inductions, reductions) Leveraging > 100 optimization passes of GCC Tight interaction with vectorizer, parallelizer and memory layout optimizations GROW Workshop, Jan 2010, Pisa, Italy 3 / 13

9 3.Compilation workflow Compilation workflow GIMPLE, SSA, CFG SCoP detection GPOLY C C++ F95 SCoPs GPOLY construction Legality check Transformations transformed GPOLY GENERIC GIMPLE GIMPLE+CFG+SSA+LOOP GRAPHITE GRAPHITE pass GLOOG (CLOOG based) GIMPLE RTL GIMPLE, SSA, CFG ASM x86 PPC SPU GROW Workshop, Jan 2010, Pisa, Italy 4 / 13

10 4.Polyhedral model Domains GPOLY - Iteration domains D S = {(v, h) 0 v, h N 1 hv 0 v < N h < N h 0 v for ( v =0; v<n; v ++) for ( h =0; h<n; h ++) out [v][h] = 0; GROW Workshop, Jan 2010, Pisa, Italy 5 / 13

11 4.Polyhedral model Domains GPOLY - Iteration domains D S = {(v, h) 0 v, h N 1 hv 0 v < N h < N for ( v =0; v<n; v ++) for ( h =0; h<n; h ++) out [v][h] = 0; h 0 v v B h C N A v 0 v N 1 h 0 h N 1 GROW Workshop, Jan 2010, Pisa, Italy 5 / 13

12 4.Polyhedral model Data accesses Data accesses - mapping iterations to memory f (i, g) = F (i, g, 1) T t 2 hv 0 v < N h < N h 0 v t 1 out[1][1] out[1][2] out[1][3] Linearized memory layout out[2][1] out[2][2] out[2][3] out[3][1] out[3][2] out[3][3] GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

13 4.Polyhedral model Data accesses Data accesses - mapping iterations to memory f (i, g) = F (i, g, 1) T t 2 hv 0 v < N h < N h 0 v t 1 out[1][1] out[1][2] out[1][3] Linearized memory layout out[2][1] out[2][2] out[2][3] out[3][1] out[3][2] out[3][3] GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

14 4.Polyhedral model Data accesses Data accesses - mapping iterations to memory f (i, g) = F (i, g, 1) T t 2 hv 0 v < N h < N h 0 v t 1 out[1][1] out[1][2] out[1][3] Linearized memory layout out[2][1] out[2][2] out[2][3] out[3][1] out[3][2] out[3][3] GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

15 4.Polyhedral model Data accesses Data accesses - mapping iterations to memory f (i, g) = F (i, g, 1) T t 2 hv 0 v < N h < N h 0 v t 1 out[1][1] out[1][2] out[1][3] Linearized memory layout out[2][1] out[2][2] out[2][3] out[3][1] out[3][2] out[3][3] GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

16 4.Polyhedral model Data accesses Data accesses - mapping iterations to memory f (i, g) = F (i, g, 1) T t 2 hv 0 v < N h < N h 0 v t 1 out[1][1] out[1][2] out[1][3] Linearized memory layout out[2][1] out[2][2] out[2][3] out[3][1] out[3][2] out[3][3] GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

17 4.Polyhedral model Data accesses Data accesses - mapping iterations to memory f (i, g) = F (i, g, 1) T t 2 hv 0 v < N h < N h 0 v t 1 out[1][1] out[1][2] out[1][3] Linearized memory layout out[2][1] out[2][2] out[2][3] out[3][1] out[3][2] out[3][3] GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

18 4.Polyhedral model Data accesses Data accesses - mapping iterations to memory f (i, g) = F (i, g, 1) T t 2 hv 0 v < N h < N h 0 v t 1 out[1][1] out[1][2] out[1][3] Linearized memory layout out[2][1] out[2][2] out[2][3] out[3][1] out[3][2] out[3][3] GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

19 4.Polyhedral model Data accesses Data accesses - mapping iterations to memory f (i, g) = F (i, g, 1) T t 2 hv 0 v < N h < N h 0 v t 1 out[1][1] out[1][2] out[1][3] Linearized memory layout out[2][1] out[2][2] out[2][3] out[3][1] out[3][2] out[3][3] GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

20 4.Polyhedral model Data accesses Data accesses - mapping iterations to memory f (i, g) = F (i, g, 1) T t 2 hv 0 v < N h < N h 0 v t 1 out[1][1] out[1][2] out[1][3] Linearized memory layout out[2][1] out[2][2] out[2][3] out[3][1] out[3][2] out[3][3] GROW Workshop, Jan 2010, Pisa, Italy 6 / 13

21 4.Polyhedral model Scheduling Scheduling - execution order t = θ S (i) = Θ S (i, g, 1) T h v 0 v < N h v 0 v < N h < N h < N h 0 v for ( v =0; v<n; v ++) for ( h =0; h<n; h ++) out [v][h] = 0; Θ S = [ ] h 0 for (t1 =0; t1 <N; t1 ++) for (t2 =0; t2 <N; t2 ++) out [t2 ][ t1] = 0; Θ S = [ ] v GROW Workshop, Jan 2010, Pisa, Italy 7 / 13

22 5.SSA-based polyhedral model bb 3 i_21 = PHI <i_11(7), 0(2)> b[i_21] = 0.0; b_i_lsm.5_16 = b[i_21]; Polyhedral representation bb 4 j_22 = PHI <j_10(5), 0(3)> pre.3_28 = PHI <D.3_9(5), 0.0(3)> D.0_6 = A[i_21][j_22]; D.1_7 = x[j_22]; D.2_8 = D.1_7 * D.0_6; D.3_9 = D.2_8 + pre.3_28; b_i_lsm.5_5 = D.3_9; j_10 = j_22 + 1; if (j_10 < N) goto <bb 5>; else goto <bb 6>; Dbb3 S = {(i) 0 i N 1 Dbb4 S = {(i, j) 0 i N 1 0 j N 1 Dbb6 S = {(i) 0 i N 1 Fdr1 = {(i, a, s1) a = 0 s1 = i 0 s1 N 1 Fdr2 = {(i, j, a, s1) a = 1 s1 = j 0 s1 N 1 Fdr4 = {(i, a, s1) a = 0 s1 = i 0 s1 N 1 θbb3 = {(i, t1, t2, t3) t1 = 0 t2 = i t3 = 0 θbb4 = {(i, j, t1, t2, t3, t4, t5) t1 = 0 t2 = i t3 = 1 t4 = j t5 = 0 bb 5 goto <bb 4> bb 6 b_i_lsm.5_30 = PHI <b_i_lsm.5_5(4)> b[i_21] = b_i_lsm.5_30; i_11 = i_21 + 1; if (i_11 < N) goto <bb 7>; else goto <bb 8>; bb 8 bb 7 MVT kernel for (i = 0; i < N; i++) { b[i] = 0; for (j = 0; j < N; j++) b[i] += A[i][j] * x[j]; return; goto <bb 3>; GROW Workshop, Jan 2010, Pisa, Italy 8 / 13

23 6.Research Cost-modelling for vectorization Cost-modelling for vectorization for ( v =0; v<n; v ++) for (h =0; h<n; h ++) { s =0; for ( i =0; i<k; i ++) for ( j =0; j<k; j ++) s+= img [v+i][h+j] * filter [i][ j]; out [v][h]= s; [Trifunovic et al. 2009] GROW Workshop, Jan 2010, Pisa, Italy 9 / 13

24 6.Research Cost-modelling for vectorization Cost-modelling for vectorization for ( v =0; v<n; v ++) for (h =0; h<n; h ++) { s =0; for ( i =0; i<k; i ++) for ( j =0; j<k; j ++) s+= img [v+i][h+j] * filter [i][ j]; out [v][h]= s; for ( v =0; v<n; v ++) for (h =0; h<n; h ++) { s =0; for (i =0; i<k; i ++) { vs [0:3]={0,0,0,0; for (j =0; j<k; j +=4) { vs [0:3]+= img [v+i][h+j:h+j +3] * filter [i][j:j +3] s+= sum (vs [0:3]); out [v][h] = s; [Trifunovic et al. 2009] GROW Workshop, Jan 2010, Pisa, Italy 9 / 13

25 6.Research Cost-modelling for vectorization Cost-modelling for vectorization for ( v =0; v<n; v ++) for (h =0; h<n; h ++) { s =0; for ( i =0; i<k; i ++) for ( j =0; j<k; j ++) s+= img [v+i][h+j] * filter [i][ j]; out [v][h]= s; for ( v =0; v<n; v ++) for (h =0; h<n; h ++) { s =0; for (i =0; i<k; i ++) { vs [0:3]={0,0,0,0; for (j =0; j<k; j +=4) { vs [0:3]+= img [v+i][h+j:h+j +3] * filter [i][j:j +3] s+= sum (vs [0:3]); out [v][h] = s; Reduction costs: sum operation vector vs is reduced into scalar s: N 2 K number of times [Trifunovic et al. 2009] GROW Workshop, Jan 2010, Pisa, Italy 9 / 13

26 6.Research Cost-modelling for vectorization Cost-modelling for vectorization for ( v =0; v<n; v ++) for (h =0; h<n; h ++) { s =0; for ( i =0; i<k; i ++) for ( j =0; j<k; j ++) s+= img [v+i][h+j] * filter [i][ j]; out [v][h]= s; for ( v =0; v<n; v ++) for (h =0; h<n; h ++) { s =0; for (i =0; i<k; i ++) { vs [0:3]={0,0,0,0; for (j =0; j<k; j +=4) { vs [0:3]+= img [v+i][h+j:h+j +3] * filter [i][j:j +3] s+= sum (vs [0:3]); out [v][h] = s; Reduction costs: sum operation vector vs is reduced into scalar s: N 2 K number of times Benefits: VF = 4 scalar ops are replaced by 1 vector operation [Trifunovic et al. 2009] GROW Workshop, Jan 2010, Pisa, Italy 9 / 13

27 6.Research Cost-modelling for vectorization Cost-modelling for vectorization for ( v =0; v<n; v ++) for (h =0; h<n; h ++) { s =0; for ( i =0; i<k; i ++) for ( j =0; j<k; j ++) s+= img [v+i][h+j] * filter [i][ j]; out [v][h]= s; for ( v =0; v<n; v ++) for (h =0; h<n; h ++) { s =0; for (i =0; i<k; i ++) { vs [0:3]={0,0,0,0; for (j =0; j<k; j +=4) { vs [0:3]+= img [v+i][h+j:h+j +3] * filter [i][j:j +3] s+= sum (vs [0:3]); out [v][h] = s; Reduction costs: sum operation vector vs is reduced into scalar s: N 2 K number of times Benefits: VF = 4 scalar ops are replaced by 1 vector operation [Trifunovic et al. 2009] GROW Workshop, Jan 2010, Pisa, Italy 9 / 13

28 6.Research Cost-modelling for vectorization Cost-modelling for vectorization Front end Vectorized loop selection Middle End GIMPLE SSA Loop nest optimization Graphite pass and loop nest level model Analytical modeling Vectorization pass Vectorization API and instruction level model Back end RTL GROW Workshop, Jan 2010, Pisa, Italy 10 / 13

29 6.Research Automatic parallelization Autopar h v 0 v < N h < N parloop () { for (h = 0; h < N; h ++) for (v = 1; v < N; v ++) x[h][v] = x[h][v -1] + 1; (a) h v 0 h 0 v v < N h < N h 0 v parloop () {. paral_data.x = &x; builtin_gomp_parallel_start ( parloop. _loopfn, &. paral_data, 4); parloop. _loopfn (&. paral_data ); builtin_gomp_parallel_end (); parloop. _loopfn (. paral_data ) { for (h = start ; h < end ; h ++) for (v = 1; v < N; v ++) (*. paral_data ->x)[h][v] = x[h][v -1] + 1; (b) GROW Workshop, Jan 2010, Pisa, Italy 11 / 13

30 7.Alias Analysis and GRAPHITE Encoding aliasing information Representation of Alias-sets in GRAPHITE Dependence Analysis requires Alias information Alias sets encoded as an extra dimension of access functions int a[10], b[10]; A_1 A_2 void foo (int *p); a p b Points-to mapping: a {A 1, p {A 1, A 2, b {A 2 Equivalent to solving Minimum Edge Clique Cover (ECC) NP-Complete problem GROW Workshop, Jan 2010, Pisa, Italy 12 / 13

31 7.Alias Analysis and GRAPHITE Empirical Analysis on Alias Graphs (4481 graphs) Only 11 graphs are interesting, up to 90 vertices In others, every connected component is a clique! 1,5 2,8 1,5 2,8 9, 10, 11, 12, 13, 14, 15, 16 9, 10, 11, 12, 13, 14, 15, 16 4,6 3,7 4,6 3,7 (i) (ii) Alias Graph from H.264 Future Work A faster algorithm using modular decomposition properties Currently, the fastest is a O( V E ) algorithm ([Gramm et al. 2009], in Haskell, using Patricia trees, does not seem simple to implement) GROW Workshop, Jan 2010, Pisa, Italy 13 / 13

32 8.Development Development Libraries used PPL - The Parma Polyhedra Library CLooG - the Chunky Loop Generator Year Commits Weekly phonecalls Every Wednesday Pisa time sip: @iptel.org GROW Workshop, Jan 2010, Pisa, Italy 14 / 13

33 9.Bibliography Bibliography [Gramm et al. 2009] J. Gramm, J. Guo, F. Hüffner and R. Niedermeier. Data reduction and exact algorithms for clique cover. J. Exp. Algorithmics, 14: , [Trifunovic et al. 2009] K. Trifunovic, D. Nuzman, A. Cohen, A. Zaks and I. Rosen. Polyhedral-Model Guided Loop-Nest Auto-Vectorization. In Parallel Architectures and Compilation Techniques (PACT 09), Raleigh, North Carolina, Sept GROW Workshop, Jan 2010, Pisa, Italy 15 / 13

34 10.Questions Thank You for Your attention Questions? GROW Workshop, Jan 2010, Pisa, Italy 16 / 13

Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today

Loop Transformations Announcements PA2 due Friday Midterm is Wednesday next week, in class, one week from today Today Recall stencil computations Intro to loop transformations Data dependencies between