ERLANGEN REGIONAL COMPUTING CENTER

Size: px

Start display at page:

Download "ERLANGEN REGIONAL COMPUTING CENTER"

Laurel Garrett
5 years ago
Views:

1 ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

2 Agenda Performance Engineering Analytic performance models A simple example: Sparse matrix-vector multiplication Another example: Sparse matrix-transpose vector multiplication An advanced chip-level model: ECM Yet another example: Composite modeling of a (P)CG solver 2

3 Performance Engineering Performance Engineering (PE): a structured process based on analytic (white-/gray-box) models to optimize/parallelize codes Basic questions addressed by analytic performance models What is the bottleneck? optimization technique What is the next bottleneck? performance potential of the optimization Am I done? What about other hardware? Impact of processor frequency and socket scalability Appropriate execution parameters, energy-optimized operating point Engineering for Performance in High Performance Computing (Bill Gropp; PASC 2015): 3

Motivation for analytic modeling Advantages of analytic models Identification of universality Identification of governing mechanisms Insight via model nature Insight via model

4 Motivation for analytic modeling Advantages of analytic models Identification of universality Identification of governing mechanisms Insight via model nature Insight via model failure Performance models Microarchitecture analysis: Determine bottlenecks and influencing factors Design space exploration: What would happen if resource X were improved? 4

5 NODE LEVEL PERFORMANCE MODEL FOR SPARSE MATRIX (TRANSPOSE) VECTOR MULTIPLY = +

6 SpMV Roofline Model (Comp. Intensity) Sparse MVM in double precision with CRS data storage: (N nzr : avg. non-zeros per row) Double precision computational intensity α quantifies traffic for loading RHS (B) α = 0 RHS is in cache α = 1/N nzr RHS loaded once α = 1 no cache I DP CRS = α > 1 Houston, we have a problem! Expected performance = b S x I CRS (Roofline model) 2 flops α + 20/N nzr byte See W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith, Towards realistic performance bounds for implicit CFD codes, in Proceedingsof Parallel CFD99. Elsevier, 1999, pp G. Schubert et al. Parallel Processing Letters 21(3), (2011). DOI: /S M. Kreutzer et al.: SIAM Journal on Scientific Computing 36(5), C401 C423 (2014). DOI: / , 7

7 SpMV Quantifying RHS impact (a) I DP CRS = 2 flops α + 20/N nzr byte = N nz 2 flops V meas V meas is measured overall memory data traffic (using, e.g., likwid) α = 1 4 V meas N nz 2 bytes 6 10 N nzr Example: kkt_power matrix from the UoF collection Intel Xeon Sandy Bridge (8c; 20 MB L3) N nz = , N nzr = 7.1 V meas 258 MB α = 0.36, αn nzr = 2.5 RHS is loaded 2.5 times from memory but: I DP CRS (1/N nzr ) DP (α) I CRS = % extra traffic optimization potential! 8

SpMV understanding performance Intel Xeon E5-2680 (8c@2.7 GHz; 20 MB L3) Assumption: α = 1 N nzr P = I DP CRS b In-cache Good agreement negligible RHS impact (min.

8 SpMV understanding performance Intel Xeon E GHz; 20 MB L3) Assumption: α = 1 N nzr P = I DP CRS b In-cache Good agreement negligible RHS impact (min. α) READ COPY Skinny matrices with large RHS/LHS volume 4 corner case matrices from UoF collection Williams collection M. Kreutzer et al.: A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units. SIAM Journal on Scientific Computing :5, C401-C423 9

9 SpMV understanding performance Socket scaling more recent architectures PWTK matrix (N nzr = 53 & α = 1 N nzr I DP CRS = 2 F P = I 12.5 B CRS DP b) b= 68 GB/s b= 46 GB/s Model covers saturation regime full chip! 10

10 SpMTransposeV understanding performance do i = 1,N r do j = row_ptr(i), row_ptr(i+1) - 1 C(col_idx(j)) = C(col_idx(j)) + val(j) * B(i) enddo enddo I DP CRS = 2 flops α + 12/N nzr byte Assume α = 1 N nzr Intel MKL b= 68 GB/s b= 68 GB/s Does the model fail? Parallelization?! (write conflicts!) 11

11 SpMTV understanding performance do i = 1,N r do j = row_ptr(i), row_ptr(i+1) - 1 C(col_idx(j)) = C(col_idx(j)) + val(j) * B(i) enddo enddo I DP CRS = 2 flops α + 12/N nzr byte b= 68 GB/s b= 68 GB/s New Parallelization Approach: Recursive Algebaric Coloring Engine (RACE) by C.L. Alappatt (FAU) Publication in preparation 12

12 THE ECM PERFORMANCE MODEL A quick walk-through

13 Cache architecture & capabilities ECM model components: Data transfer times Optimistic transfer times through mem hierarchy T i = V i b i Transfer time notation for some given loop kernel: L1 b L1L2 L2 T L1L2 T L2L3 T L3Mem = Τ cy 8 iter Input: Cache properties Application data transfer prediction b L2L3 L3 b L3Mem Memory b L2Mem 14

14 Core machine model ECM model components: In-core execution cy LD ST ADD MUL... 1 cy LD LD 4 cy cy MUL 5 cy ADD 3 cy ADD 3 cy ST Best case: max throughput Worst case: critical path T min core = max T nol, T OL T max core = T CP T nol interacts with cache hierarchy, T OL does not 15

15 ECM model components: Overlap assumptions Notation for model contributions T OL T nol T L1L2 T L2L3 T L3Mem = Τ cy 8 iter Most pessimistic overlap model: no overlap T Mem ECM = max T OL, T nol + T L1L2 + T L2L3 + T L3Mem for in-mem data t [cy] T nol T L1L2 T L2L3 T L3Mem T OL Appropriate for most Intel Xeon CPUs up to and including Broadwell 16

16 ECM model: Notation for runtime predictions {T ECM L1 T ECM L2 T ECM L3 T ECM Mem } Example: no-overlap model {max(t OL, T nol ) max(t OL, T nol + T L1L2 ) max(t OL, T nol + T L1L2 + T L2L3 ) max(t OL, T nol + T L1L2 + T L2L3 + T L3Mem )} L1 L2 L3 Memory data in 17

17 ECM model: (Naive) saturation assuption Performance is assumed to scale across cores until a shared bandwidth bottleneck is hit T ECM n = max T ECM Mem n, T L3Mem n S = T ECM Roofline bandwidth ceiling Mem T L3Mem This is (sometimes) too optimistic near the saturation point. For improvements see J. Hofmann, G. Hager, and D. Fey: On the accuracy and usefulness of analytic energy models for contemporary multicore processors. Proc. ISC High Performance DOI: / _2 18

18 MODELING A CONJUGATE-GRADIENT SOLVER Building a model from components

19 A matrix-free CG solver 2D 5-pt FD Poisson problem Dirichlet BCs, matrix-free N x x N y = grid CPU: Haswell E5-2695v3 CoD mode 20

20 Machine characteristics 2.3 GHz (fixed core & Uncore) AVX2, 2 x FMA per cycle 2 load & 1 store per cycle (in practice: 1+1) Cache characteristics Inclusive, non-overlapping b L1L2 = 43 Byte/cy (gray) b L2L3 = 32 Byte/cy (white) Memory bandwidth per ccnuma domain (saturated) b read = 32.3 GByte/s = 11.3 Byte/cy b copy = 26.1 GByte/s = 14.0 Byte/cy 21

21 ECM model composition Naive implementation of all kernels (omp parallel for) while(α 0 < tol): T x [cy/8 iter] ECM T Mem [cy/8 iter] n s [cores] Full domain limit [cy/8 iter] Ԧv = A Ԧp { } λ = α 0 / Ԧv, Ԧp { } Ԧx = Ԧx + λ Ԧp { } Ԧr = Ԧr λ Ԧv { } α 1 = Ԧr, Ԧr { } Ԧp = Ԧr + α 1 α 0 Ԧp, α 0 = α 1 { } Sum

22 MLUP/s CG performance 1 core to full socket Multi-loop code well represented Single core performance predicted with 5% error Saturated performance predicted with < 0.5% error Saturation point predicted approximately Can be fixed by improved ECM model # cores 23

23 Overhead [cy] Barrier cost Does the OpenMP barrier impact the performance? At which problem size can this be expected? Overhead (1 NUMA LD): Intel gcc x 2000 cy Time: 10 x 4x10 8 cy No problem for any out of cache data set 24

24 CG with GS preconditioner: Naïve parallelization Pipeline parallel processing: OpenMP barrier after each wavefront update (ugh!) k T0 T1 T2 T3 T4 i 25

25 CG with GS preconditioner: additional kernels Intel IACA T x [cy/8 iter] ECM T Mem [cy/8 iter] n s [cores] Full domain limit [cy/8 iter] Non-PC model Ԧz = P Ԧr (fw) { } Ԧz = P Ԧr (bw) { } α = Ԧr, Ԧz { } Sum Back substitution does not saturate the memory bandwidth! full algorithm does not fully saturate Impact of barrier still negligible overall, but noticeable in the preconditioner 26

26 MLUP/s PCG measurement <2% model error for single threaded and saturated performance Expected large impact of barrier at smaller problem sizes in x direction # cores 27

27 Conclusions Analytic modeling is worth the effort Even if it s inaccurate Even Roofline can yield amazing insights Analytic modeling is not just for simple kernels is composable can cover programming model overhead, too and yes, it is real work. Automatic Roofline/ECM modeling tool No, it does not always work. 28

More Science per Joule: Bottleneck Computing

More Science per Joule: Bottleneck Computing Georg Hager Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg Germany PPAM 2013 September 9, 2013 Warsaw, Poland Motivation (1): Scalability