A simple Concept for the Performance Analysis of Cluster-Computing

Size: px

Start display at page:

Download "A simple Concept for the Performance Analysis of Cluster-Computing"

Brittany O’Brien’
6 years ago
Views:

1 A simple Concept for the Performance Analysis of Cluster-Computing H. Kredel 1, S. Richling 2, J.P. Kruse 3, E. Strohmaier 4, H.G. Kruse 1 1 IT-Center, University of Mannheim, Germany 2 IT-Center, University of Heidelberg, Germany 3 Institute of Geosciences, Goethe University Frankfurt, Germany 4 Future Technology Group, LBNL, Berkeley, USA ISC 13, Leipzig, 18. June 2013

2 Outline Introduction Performance Model Applications Scalar-Product of Vectors Matrix Multiplication Linpack TOP500 Conclusions

3 Introduction Motivation Sophisticated mathematical models for performance analysis cannot keep up with rapid hardware development. There is a lack of reliable rules of thumb to estimate the size and performance of clusters. Goals Development of a simple and transparent model. Restriction to few parameters describing hardware and software. Using speed-up as a dimensionless metric. Finding the optimal size of a cluster for a given application. Validation of the results by modeling of standard kernels.

4 Related Work Roofline model for multi-cores (Williams et al. 2009) Performance models by Hockney: Model with few hardware and software parameters, focus on benchmark runtimes and performance (Hockney 1987, Hockney & Jesshope 1988) Model based on similarities to fluid dynamics (Hockney 1995) Performance models by Numrich: Based on Newtons classical mechanics (Numrich 2007) Based on dimension analysis (Numrich 2008) Based on the Pi theorem (Numrich 2010) Linpack performance model (Luszczek & Dongarra 2011) Performance model based on a stochastic approach (Kruse 2009, Kredel et al. 2010) Performance model for interconnected clusters (Kredel et al. 2012)

5 Model Parameters Hardware Parameters l peak 1 l peak 2 l peak 3 l peak 4 l peak p p l peak k=1,p b c number of processing units (PUs) theoretical peak performance of each PU bandwidth of the network Software Parameters #op total number of arithmetic operations #b total number of bytes involved #x total number of bytes communicated between the PUs

6 Distribution of the work load (#op,#b) Homogeneous case Distribution of operations #op o 1 o 2 o 3 o 4 o p o k = #op/p (or ω k = 1/p) Distribution of data #b d 1 d 2 d 3 d 4 d p d k = #b/p (or δ k = 1/p)

7 Distribution of the work load (#op,#b) Heterogeneous case additional parameters (ω k,δ k ) Distribution of operations #op o 1 o 2 o 3 o 4 o p o k = ω k #op Distribution of data #b with p ω k = 1 k=1 d 1 d 2 d 3 d 4 d p d k = δ k #b with p δ k = 1 k=1

8 Performance Indicators Primary performance measure t Total time to process the work load (#op, #b) Derived performance measures l(p) = #op t S = l(p) l(1) Performance Speed-up (dimensionless) Goal: Speed-up as a function of total work load (#op, #b) [Flop, Byte] work distribution (ω k, δ k ) communication requirements #x [Byte] hardware parameters (p, l peak k, b c ) [-,Flop/s, Byte]

9 Total execution time Computation time t r = max{ t 1 (o 1, d 1 ),..., t n (o p, d p ) } o k l k Communication time t c #x b c Total execution time t t r + t c t o k l peak + #x b c k o k l peak k

10 Total execution time t ω k #op l peak k + #x b c = ω k #op l peak k t ω k #op l peak k ( ) 1 + lpeak k #b bc ω k #op #x #b ) (1 + 1xk One dimensionless parameter for hardware + software x k = ω k a a k r a = #op #b computational intensity of the software [Float/Byte] ak = lpeak computational intensity of the hardware [Float/Byte] kb c r = #b #x inverse communication intensity [-]

11 Performance and Speed-up Performance l = #op t Speed-up lpeak kω k x k 1 + x k S = l(p) l(1) = l k(ω k < 1) l k (ω k = 1) = 1 + x k(ω k = 1) 1 + ω k x k (ω k = 1) x k (ω k = 1) = a a k S = r = a 1 + ˆx k r z 1 + ω(k, p) ˆx k r z p b c bc 0 l peak r = a k l peak bc b 0 k c r = ˆx k z r general case with ω k = ω(k, p)/p S = 1 + ˆx r z 1 + ˆx r z p homogeneous case with ω(k, p) = 1

12 Application-oriented Analysis Application characterized by problem size n. Software Parameters #op #op(n) #b #b(n) #x #x(n, p) Analysis of the performance of a homogeneous cluster l p l peak x x + 1 = lpeak y r(n, p) 1 + y r(n,p) p 1 d(p) p With x = ˆx z r(n, p)/p = y r(n, p)/p y c(n) Number of PUs p 1/2 necessary to reach half of the maximum performance of all p PUs. l(p 1/2 ) = 1 2 plpeak y r(n, p 1/2 ) = p 1/2 Number of PUs p to obtain the maximum of the performance dl dp = 0 p2 max d (p max ) = y = ˆx z c(n)

Compute resources for the simulations bwgrid Cluster Site

420 Tübingen 140 Ulm/Konstanz 280 Freiburg 140 Esslingen

single cluster) Karlsruhe Heidelberg Stuttgart Esslingen

13 Compute resources for the simulations bwgrid Cluster Site Nodes Mannheim 140 Heidelberg 140 Karlsruhe 140 Stuttgart 420 Tübingen 140 Ulm/Konstanz 280 Freiburg 140 Esslingen 180 Total 1580 Mannheim Frankfurt (interconnected to a single cluster) Karlsruhe Heidelberg Stuttgart Esslingen Freiburg Tübingen Ulm (joint cluster with Konstanz) München

14 bwgrid Hardware Node Configuration 2 Intel Xeon CPUs, 2.8 GHz (each CPU with 4 Cores) 16 GB Memory 140 GB hard drive (since January 2009) InfiniBand Network (20 Gbit/sec) Hardware parameters for our model l peak = 8 GFlop/sec (for one core) b c = 1.5 GByte/sec (node-to-node) bc 0 = 1.0 GByte/sec (reference bandwidth)

15 Scalar-Product of two Vectors (u, v) = k u k v k Software Parameters #op = 2 n 1 2n if n 1 #b = 2 n w #x = p w = 8p Speed-up S = 1 + x 1 + x/p with x = 3 64 n p Simulations Vector sizes up to n = runs for each configuration (p, n) Speed-up calculated from mean run-times

16 Speed-up for Scalar Product 450 scalarproduct with size n n = 10 5, experimental n = 10 5, theoretical n = , experimental n = , theoretical n = 10 6, experimental n = 10 6, theoretical n = 10 7, experimental n = 10 7, theoretical S(p) p

17 Matrix Multiplication A n n B n n = C n n on a p p processor-grid Software Parameters #op = 2n 3 n 2 2n 3 #b = 2n 2 w #x = 2n 2 p(1 1 p )w 2n 2 w p Speed-up S = 1 + x 1 + x/p Simulations with x = n p Matrix sizes up to n = Cannon s algorithm Runs with 8 and 4 cores per node

18 Speed-up for Matrix Multiplication

19 Linpack Solution of Ax = b Software Parameters #op = 2 3 n3 #b = 2n 2 ( w #x = 3α Speed-up S 1 + x 1 + x/p Simulations 1 + log 2 p 12 ) n 2 w with x = Matrix sizes up to n 128 and α = 1/3 Smaller α would lead to better fits for small p.

20 Speed-up for Linpack

21 Linpack on bwgrid Half of Peak performance at: p 1/2 = y 3α = n 128 Maximum performance at: p max = (24 ln 2/128) n = 24 ln(2)p 1/2 Region with good performance for n = p = [p 1/2, p max ] = [80, 1300] Maximum performance l max = lpeak y 9 3α 10 l max = 560 GFlop/sec for n = 10000

22 TOP500 Maximum performance l max = n b c 3w 9 10 In TOP500 list: l max R max and n N max Bandwidth b c not in the list. Derive Effective Bandwidth bc eff = R max 3w 10 N max 9 Analyze which parameter predicts ranking best first 100 systems excluding systems with accelerators and missing N max comparison with single core performance l peak = R max /p max

23 TOP500 November 2011 Blue: Linpack-Performance per core Red: Derived effective Bandwidth b_c^eff [GByte/sec] l^th [GFlop/sec] Rank in TOP500 list (Nov. 2011)

24 TOP500 November 2012 Blue: Linpack-Performance per core Red: Derived effective Bandwidth b_c^eff [GByte/sec] l^th [GFlop/sec] Rank in TOP500 List (November 2012)

25 Conclusions Developed a performance model which integrates the characteristics of hardware and software with a few parameters. Model provides simple formulae for performance and speed-up. Results compare reasonably well with simulations of standard applications. Model allows estimation of the optimal size of a cluster for a given class of applications. Model allows estimation of the maximum performance for a given class of applications. Identified effective bandwidth as a key performance indicator for Linpack (TOP500) on compute clusters. Future work: Analysis of inhomogeneous clusters with asymmetric load distribution Further applications: Sparse matrix-vector operations and FFT

A hierarchical Model for the Analysis of Efficiency and Speed-up of Multi-Core Cluster-Computers

A hierarchical Model for the Analysis of Efficiency and Speed-up of Multi-Core Cluster-Computers H. Kredel 1, H. G. Kruse 1 retired, S. Richling2 1 IT-Center, University of Mannheim, Germany 2 IT-Center,