2.5D algorithms for distributed-memory computing

Size: px

Start display at page:

Download "2.5D algorithms for distributed-memory computing"

Ruby Shepherd
5 years ago
Views:

1 ntroduction for distributed-memory computing C Berkeley July, / 62

2 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62

3 ntroduction Strong scaling Solving science problems faster Parallel computers can solve bigger problems weak scaling Parallel computers can also solve a fixed problem faster strong scaling Obstacles to strong scaling may increase relative cost of communication may hurt load balance How to reduce communication and maintain load balance? reduce (minimize) communication along the critical path exploit the network topology 3/ 62

4 ntroduction Strong scaling Topology-aware multicasts (BG/P vs Cray) 1 MB multicast on BG/P, Cray XT5, and Cray XE BG/P XE6 XT5 Bandwidth (MB/sec) #nodes 512 4/

5 ntroduction Strong scaling 2D rectangular multicasts trees 2D 4X4 Torus Spanning tree 1 Spanning tree 2 Spanning tree 3 Spanning tree 4 All 4 trees combined root 5/ 62

6 ntroduction 2.5D factorization Blocking matrix multiplication A B B A B A B A 6/ 62

7 ntroduction 2.5D factorization 2D matrix multiplication [Cannon 69], [Van De Geijn and Watts 97] A B 16 s (4x4) B A O(n2 /p) bytes of memory B A O(n3 /p) flops O(n2 / p) words moved O( p) messages B A 7/ 62

8 ntroduction 2.5D factorization 3D matrix multiplication [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89], [McColl and Tiskin 99] 64 s (4x4x4) A O(n3 /p) flops B B O(n2 /p 2/3 ) words moved O(1) messages A B A O(n2 /p 2/3 ) bytes of memory A B 4 copies of matrices 8/ 62

9 ntroduction 2.5D factorization [McColl and Tiskin 99] A B 32 s (4x4x2) B A A B B O(n3 /p) flops O(n2 / c p) words moved q O( p/c 3 ) messages O(c n2 /p) bytes of memory 2 copies of matrices A 9/ 62

10 ntroduction 2.5D factorization 2.5D strong scaling n = dimension, p = #processors, c = #copies of data must satisfy 1 c p 1/3 special case: c = 1 yields 2D algorithm special case: c = p 1/3 yields 3D algorithm cost(2.5d MM(p, c)) = O(n3 /p) flops + O(n2 / c p) words moved q + O( p/c 3 ) messages *ignoring log(p) factors 10/ 62

11 ntroduction 2.5D factorization 2.5D strong scaling n = dimension, p = #processors, c = #copies of data must satisfy 1 c p 1/3 special case: c = 1 yields 2D algorithm special case: c = p 1/3 yields 3D algorithm cost(2d MM(p)) = O(n3 /p) flops + O(n2 / p) words moved + O( p) messages = cost(2.5d MM(p, 1)) *ignoring log(p) factors 11/ 62

12 ntroduction 2.5D factorization 2.5D strong scaling n = dimension, p = #processors, c = #copies of data must satisfy 1 c p 1/3 special case: c = 1 yields 2D algorithm special case: c = p 1/3 yields 3D algorithm cost(2.5d MM(c p, c)) = O(n3 /(c p)) flops + O(n2 /(c p)) words moved + O( p/c) messages = cost(2d MM(p))/c perfect strong scaling 12/ 62

13 ntroduction 2.5D factorization Strong scaling matrix multiplication 2.5D MM on BG/P (n=65,536) Percentage of machine peak D MM 2D MM ScaAPACK PDGEMM p 13/ 62

14 ntroduction 2.5D factorization 2.5D MM on 65,536 cores 2.5D MM on 16,384 nodes of BG/P Percentage of flops peak D MM 2D MM n 14/ 62

15 ntroduction 2.5D factorization Cost breakdown of MM on 65,536 cores Execution time normalized by 2D Matrix multiplication on 16,384 nodes of BG/P communication idle computation 95% reduction in comm n= , n= 2D n= ,2.5D 10 n= ,2 D 15/ 62 72,2.5D

16 ntroduction 2.5D factorization 2.5D recursive A = where is lower-triangular and is upper-triangular A 2.5D recursive algorithm with no pivoting [A. Tiskin 2002] Tiskin gives algorithm under the BSP model Bulk Synchronous Parallel considers communication and synchronization We give an alternative distributed-memory adaptation and implementation Also, we lower-bound the latency cost 16/ 62

17 ntroduction 2.5D factorization 2D blocked factorization A 17/ 62

18 ntroduction 2.5D factorization 2D blocked factorization ₀₀ ₀₀ 18/ 62

19 ntroduction 2.5D factorization 2D blocked factorization 19/ 62

20 ntroduction 2.5D factorization 2D blocked factorization S=A- 20/ 62

21 ntroduction 2.5D factorization 2D block-cyclic decomposition / 62

22 ntroduction 2.5D factorization 2D block-cyclic factorization 22/ 62

23 ntroduction 2.5D factorization 2D block-cyclic factorization 23/ 62

24 ntroduction 2.5D factorization 2D block-cyclic factorization S=A- 24/ 62

25 ntroduction 2.5D factorization A new latency lower bound for Relate volume to surface area to diameter For block size n/d does A₀₀ k₁ A₁₁ 2 Ω(n /d ) flops Ω(n2 /d) words Ω(d) msgs k₂ A₃₃ k₄ d = Ω( p) to minimize flops d = Ω( c p) to minimize words A₄₄ cr iti c al pa th kd-1 n More generally, latency bandwidth = n2 A₂₂ k₃ n Now pick d (=latency cost) 3 k₀ 25/ 62 Ad-1,d-1

26 ntroduction 2.5D factorization 2.5D factorization ₀₀ ₀₀ ₀₀ ₀₃ ₀₀ ₀₀ ₀₃ ₀₀ ₀₀ (A) ₀₀ ₃₀ ₂₀ ₀₁ ₁₀ 26/ 62

27 ntroduction 2.5D factorization 2.5D factorization ₀₀ (B) ₀₀ ₀₀ ₀₃ ₀₀ ₀₀ ₀₃ ₀₀ ₀₀ (A) ₀₀ ₃₀ ₂₀ ₀₁ ₁₀ 27/ 62

28 ntroduction 2.5D factorization 2.5D factorization ₀₀ (B) ₀₀ ₀₀ ₀₃ ₀₀ ₀₀ ₀₃ ₀₀ ₀₀ (A) ₀₀ ₃₀ ₂₀ ₀₁ ₁₀ (C) (D) 28/ 62

29 ntroduction 2.5D factorization 2.5D factorization ook at how this update is distributed. What does it remind you of? 29/ 62

30 ntroduction 2.5D factorization 2.5D factorization ook at how this update is distributed. A B B A B Same 3D update in multiplication A B A 30/ 62

31 ntroduction 2.5D factorization 2.5D strong scaling (without pivoting) 2.5D on BG/P (n=65,536) Percentage of machine peak D (no pvt) 2D (no pvt) p 31/ 62

32 ntroduction 2.5D factorization 2.5D with pivoting A = P, where P is a permutation matrix 2.5D generic pairwise elimination (neighbor/pairwise pivoting or Givens rotations (QR)) [A. Tiskin 2007] pairwise pivoting does not produce an explicit pairwise pivoting may have stability issues for large matrices Our approach uses tournament pivoting, which is more stable than pairwise pivoting and gives explicitly pass up rows of A instead of to avoid error accumulation 32/ 62

33 ntroduction 2.5D factorization Tournament pivoting (CA-pivoting) {P,, } CA-pivot(A,n) if n b then base case {P,, } = partial-pivot(a) else recursive case T [AT 1, A2 ] = A {P1, 1, 1 } = CA-pivot(A1 ) [R1T, R2T ] = P1T A1 {P2, 2, 2 } = CA-pivot(A2 ) [S1T, S2T ] = P2T A2 {Pr,, } = partial-pivot([r1t, S1T ]) Form P from Pr, P1 and P2 end if 33/ 62

34 ntroduction 2.5D factorization Tournament pivoting Partial pivoting is not communication-optimal on a blocked matrix requires message/synchronization for each column O(n) messages needed Tournament pivoting is communication-optimal performs a tournament to determine best pivot row candidates passes up best rows of A 34/ 62

35 ntroduction 2.5D factorization 2.5D factorization with tournament pivoting PA₀ PA₃ PA₂ PA₁ PA₀ P P P P P P P P 35/ 62

36 ntroduction 2.5D factorization 2.5D factorization with tournament pivoting PA₀ PA₃ PA₂ PA₁ PA₀ ₀₁ ₁₀ ₀₁ ₁₀ ₀₁ ₁₀ P ₀₁ ₁₀ P ₂₀ P P ₃₀ P P ₄₀ P P 36/ 62

37 ntroduction 2.5D factorization 2.5D factorization with tournament pivoting PA₀ PA₃ PA₂ PA₁ PA₀ ₀₁ ₁₀ ₀₁ ₁₀ ₀₁ ₁₀ P ₀₁ ₁₀ P ₂₀ P P ₃₀ P ₄₀ P P P ₀₁ da ₁₀ p te ₀₁ ate d p ₁₀ ₀₁ da ₁₀ p te ₀₁ da ₁₀ p te da ₂₀ p te te da ₃₀ da p ₄₀ p te 37/ 62

38 ntroduction 2.5D factorization 2.5D factorization with tournament pivoting PA₀ PA₃ PA₂ PA₁ PA₀ ₀₁ ₁₀ ₀₁ ₁₀ ₀₁ ₁₀ P ₀₁ ₁₀ P ₂₀ P P ₃₀ P ₄₀ P P P ₀₁ da ₁₀ p ₀₀ ₀₀ te ₀₁ ate ₁₀ d p ₀₀ ₀₀ ₀₁ ₀₃ da ₁₀ p te ₃₀ ₀₁ da ₁₀ p te da ₂₀ p te te da ₃₀ da p ₄₀ p ₀₀ ₀₀ ₂₀ ₁₀ ₀₂ ₀₀ ₀₀ ₀₁ te 38/ 62

39 ntroduction 2.5D factorization Strong scaling of 2.5D with tournament pivoting 2.5D on BG/P (n=65,536) Percentage of machine peak D (CA-pvt) 2D (CA-pvt) ScaAPACK PDGETRF p 39/ 62

40 ntroduction 2.5D factorization 2.5D on 65,536 cores on 16,384 nodes of BG/P (n=131,072) 100 Time (sec) 80 communication idle compute 60 2X faster 40 2X faster 20 0 NO -pi vo t NO 2D -pi vo t2 CA -pi.5d vo t2 CA -pi v D 40/ 62 ot 2.5D

41 ntroduction 2.5D factorization A = Q R where Q is orthogonal R is upper-triangular 2.5D QR using Givens rotations (generic pairwise elimination) is given by [A. Tiskin 2007] Tiskin minimizes latency and bandwidth by working on slanted panels 2.5D QR cannot be done with right-looking updates as 2.5D due to non-commutativity of orthogonalization updates 41/ 62

42 ntroduction 2.5D factorization using the YT representation The YT representation of Householder QR factorization is more work efficient when computing only R We give an algorithm that performs 2.5D QR using the YT representation The algorithm performs left-looking updates on Y Householder with YT needs fewer computation (roughly 2x) than Givens rotations Our approach achieves optimal bandwidth cost, but has O(n) latency 42/ 62

43 ntroduction 2.5D factorization 2.5D QR using YT representation 43/ 62

44 ntroduction 2.5D factorization atency-optimal 2.5D QR To reduce latency, we can employ the TSQR algorithm 1. Given n-by-b panel partition into 2b-by-b blocks 2. Perform QR on each 2b-by-b block 3. Stack computed Rs into n/2-by-b panel and recursve 4. Q given in hierarchical representation Need YT representation from hierarchical Q Take Y to be the first b columns of Q minus the identity Define T = (Y T Y ) 1 Sacrifices triangular structure of T Conjecture: stable if Q diagonal elements selected to be negative 44/ 62

45 ntroduction Given input graph G = (V, E ) Find shortest paths between each pair of nodes vi, vj Reduces to semiring matrix multiplication with a dependency along k Computational structure is similar to factorization Semiring matrix multiplication (SMM) Replace scalar multiply with scalar add Replace scalar add with scalar min Depending on processor can require more instructions 45/ 62

46 ntroduction A recursive algorithm for all-pairs shortest-paths Known alogirhtm for recusively computing APSP: 1. Given adjacency matrix A of graph G 2. Recursve on block A11 3. Compute SMM A12 A11 A12 4. Compute SMM A21 A21 A11 5. Compute SMM A22 A21 A12 6. Recursve on block A22 7. Compute SMM A21 A22 A21 8. Compute SMM A12 A12 A22 9. Compute SMM A11 A12 A21 46/ 62

47 ntroduction Block-cyclic recursive parallelization Cyclic step Blocked step P11 P11 P12 P13 P14 n P21 P22 P23 P24 P31 P32 P33 P34 P12 P13 P14 P11 P12 n/4 P21 P22 P23 P24 P21 P22 n/4 n/2 P31 P32 P33 P34 P41 P42 P43 P44 P41 P42 P43 P44 n/2 n 47/ 62

48 ntroduction 2.5D APSP 2.5D recursive parallelization is straight-forward Perform cyclic-steps using a 2.5D process grid Decompose blocked-steps using an octant of the grid Switch to 2D algorithm when grid is 2D Minimizes latency and bandwidth! 48/ 62

49 ntroduction 2.5D APSP strong scaling performance Strong scaling of DC-APSP on Hopper D n=32,768 2D n=32, D n=8,192 2D n=8, GFlops Number of compute nodes (p) 49/ 62

50 ntroduction Block-size gives latency-bandwidth tradewoff Performance of 2D DC-APSP on 256 nodes of Hopper (n=8,192) 700 2D DC-APSP 600 GFlops block size (b) 50/ 62

51 ntroduction Blocked vs block-cyclic vs cyclic decompositions Blocked layout Block-cyclic layout Green denotes ﬁll (unique values) Cyclic layout Red denotes padding / load imbalance 51/ 62

52 ntroduction Cyclops Tensor Framework (CTF) Big idea: decompose tensors cyclically among processors pick cyclic phase to preserve partial/full symmetric structure 52/ 62

53 ntroduction 3D tensor contraction 53/ 62

54 ntroduction 3D tensor cyclic decomposition 54/ 62

55 ntroduction 3D tensor mapping 55/ 62

56 ntroduction A cyclic layout is still challenging n order to retain partial symmetry, all symmetric dimensions of a tensor must be mapped with the same cyclic phase The contracted dimensions of A and B must be mapped with the same phase And yet the virtual mapping, needs to be mapped to a physical topology, which can be any shape 56/ 62

57 ntroduction Virtual processor grid dimensions Our virtual cyclic topology is somewhat restrictive and the physical topology is very restricted Virtual processor grid dimensions serve as a new level of indirection f a tensor dimension must have a certain cyclic phase, adjust physical mapping by creating a virtual processor dimension Allows physical processor grid to be stretchable 57/ 62

58 ntroduction Virtual processor grid construction Matrix multiply on 2x3 processor grid. Red lines represent virtualized part of processor grid. Elements assigned to blocks by cyclic phase. B A X C = 58/ 62

59 ntroduction for tensors We incorporate data replication for communication minimization into CTF Replicate only one tensor/matrix (minimize bandwidth but not latency) Autotune over mappings to all possible physical topologies Select mapping with least amount of communication Achieve minimal communication for tensors of widely different sizes 59/ 62

60 ntroduction Percentage of theoretical symmetric peak Preliminary contraction results on Blue Gene/P Cyclops TF strong scaling 4D sym (CCSD) 6D sym (CCSDT) 8D sym (CCSDTQ) 1.3E E E E E E E E #nodes 60/ 62

61 ntroduction Preliminary Coupled Cluster results on Blue Gene/Q A Coupled Cluster with Double exictations (CCD) implementations is up and running Already scaled on up to 1024 nodes of BG/Q, up to 400 virtual orbitals Preliminary results already indicate performance matching NWChem Several major optimizations still in-progress Expecting significantly better scalability than any existing software 61/ 62

62 ntroduction Our contributions: 2.5D mapping of matrix multiplication A new latency lower bound for Communication-optimal 2.5D, QR, and APSP Optimal according to lower bounds [rony, Tiskin, Toledo 04] and [Aggarwal, Chandra, and Snir 90] Both are bandwidth-optimal according to general lower bound [Ballard, Demmel, Holtz, Schwartz 10] is latency-optimal according to new lower bound Cyclops Tensor Framework Runtime autotuning to minimize communication Topology-aware mapping in any dimension with symmetry considerations 62/ 62

63 Rectangular collectives Backup slides 63/ 62

64 Rectangular collectives Performance of multicast (BG/P vs Cray) 1 MB multicast on BG/P, Cray XT5, and Cray XE BG/P XE6 XT5 Bandwidth (MB/sec) #nodes /

65 Rectangular collectives Why the performance discrepancy in multicasts? Cray machines use binomial multicasts Form spanning tree from a list of nodes Route copies of message down each branch Network contention degrades utilization on a 3D torus BG/P uses rectangular multicasts Require network topology to be a k-ary n-cube Form 2n edge-disjoint spanning trees Route in different dimensional order se both directions of bidirectional network 65/ 62

Scalable numerical algorithms for electronic structure calculations

Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster