Algorithms as Multilinear Tensor Equations

Size: px

Start display at page:

Download "Algorithms as Multilinear Tensor Equations"

Baldwin Black
5 years ago
Views:

1 Algorithms as Multilinear Tensor Equations Edgar Solomonik Department of Computer Science ETH Zurich Georgia Institute of Technology, Atlanta GA, USA February 18, 2016 Edgar Solomonik Algorithms as Multilinear Tensor Equations 1/40

2 Pervasive paradigms in scientific computing What commonalities exist among resource-intensive computations in simulation and data analysis? multidimensional datasets (observations, discretizations) higher-order relations between datasets, i.e. equations, maps, graphs, hypergraphs sparsity and symmetry in structure of relations relations lead to solution directly or by acting as an evolutionary (iterative) criterion algebraic descriptions of datasets and relations Edgar Solomonik Algorithms as Multilinear Tensor Equations 2/40

3 Pervasive paradigms in scientific computing What type of abstractions are desirable in high performance computing? data abstractions should reflect native dimensionality and structure global functional abstractions should efficiently orchestrate communication and synchronization abstractions should enable development of provably efficient algorithms Edgar Solomonik Algorithms as Multilinear Tensor Equations 3/40

4 Outline Introduction to tensor computations Symmetry-preserving tensor algorithms Communication-avoiding parallel algorithms A massively-parallel tensor framework Applications to electronic structure calculations Conclusion Edgar Solomonik Algorithms as Multilinear Tensor Equations 4/40

5 Basic data (vector) parallelism Edgar Solomonik Algorithms as Multilinear Tensor Equations 5/40

6 Basic data (vector) parallelism c i = f (a i ) Edgar Solomonik Algorithms as Multilinear Tensor Equations 6/40

7 Basic data (vector) parallelism c i = f (a i ) Edgar Solomonik Algorithms as Multilinear Tensor Equations 6/40

8 Dense matrix operators c = Ab Edgar Solomonik Algorithms as Multilinear Tensor Equations 7/40

9 Sparse matrix operators c = Ab Edgar Solomonik Algorithms as Multilinear Tensor Equations 8/40

10 Sparse matrix operators c i = j A ij b j Edgar Solomonik Algorithms as Multilinear Tensor Equations 8/40

11 Low-rank matrix factorizations A = UV c = UVb Edgar Solomonik Algorithms as Multilinear Tensor Equations 9/40

12 Low-rank matrix factorizations A ij = k U ik V kj c i = j,k U ik V kj b j Edgar Solomonik Algorithms as Multilinear Tensor Equations 9/40

13 Low-rank matrix factorizations c = UΣVb Edgar Solomonik Algorithms as Multilinear Tensor Equations 9/40

14 Tensor operator d i = j,k A ijk b j c k Edgar Solomonik Algorithms as Multilinear Tensor Equations 9/40

15 Tensor operator d i = j,k A ijk b j c k Edgar Solomonik Algorithms as Multilinear Tensor Equations 10/40

16 Low-rank tensor factorizations (CP) A ijk = l U il V jl W kl d i = j,k,l U il V jl W kl b j c k Edgar Solomonik Algorithms as Multilinear Tensor Equations 11/40

17 Low-rank tensor factorizations (CP) A ijk = l U il V jl W kl d i = j,k,l U il V jl W kl b j c k Edgar Solomonik Algorithms as Multilinear Tensor Equations 11/40

18 Low-rank tensor factorizations (Tucker) A ijk = T lmn U il V jm W kn d i = T lmn U il V jm W jn b j c k l,m,n j,k,l,m,n Edgar Solomonik Algorithms as Multilinear Tensor Equations 12/40

19 Low-rank tensor factorizations (Tucker) A jk i = T lmn U il V jm W kn d i = T lmn U il V jm W kn b j c k l,m,n j,k,l,m,n Edgar Solomonik Algorithms as Multilinear Tensor Equations 12/40

20 Low-rank tensor factorizations (TT) A ijk = l,m U il V ljm W mk U il V ljm W mk b j c k j,k,l,m Edgar Solomonik Algorithms as Multilinear Tensor Equations 13/40

21 Low-rank tensor factorizations (TT) A ijk = l,m U il V ljm W mk U il V ljm W mk b j c k j,k,l,m Edgar Solomonik Algorithms as Multilinear Tensor Equations 13/40

22 Tensor contractions in electronic structure methods Amplitude equation snippet from coupled cluster doubles model 0 = V abij + T abik F kj + T abkl V klmn T mnij +... k k,l,m,n Tensor hypercontraction representation a T abij = l,m ψ al φ jl Z lm φ jm ψ bm a Hohenstein, Parrish, Sherrill, Martinez, JCP, 2012 Edgar Solomonik Algorithms as Multilinear Tensor Equations 14/40

23 Structured tensor computations Challenges for matrix/tensor algebraic abstractions data and relation sparsity tensor sparsity low-order representations of data tensor decompositions implicitly defined relations implicit tensor representations data and relation equivalences tensor symmetries Edgar Solomonik Algorithms as Multilinear Tensor Equations 15/40

24 Exploiting symmetry in tensors Tensor symmetry (e.g. A ij = A ji ) reduces memory and cost for order d tensor, d! less memory dot product i,j A ijb ij = 2 i<j A ijb ij + i A iib ii matrix-vector multiplication 1 c i = A ij b j = ( A ij (b i + b j ) A ij )b i j j j rank-2 vector outer product 1 C ij = a i b j + a j b i = (a i + a j )(b i + b j ) a i b i a j b j squaring a symmetric matrix (or AB + BA) 1 C ij = k A ik A kj = k (A ik + A kj + A ij ) 2... for order ω contraction, ω! fewer multiplications 1 1 S., Demmel; Technical Report, ETH Zurich, Edgar Solomonik Algorithms as Multilinear Tensor Equations 16/40

25 Symmetry preserving algorithms By exploiting symmetry, we can reduce the number of multiplications at the cost of more additions 2 partially symmetric contractions symmetry preserving algorithm can be nested over each index group reduction in multiplications implies reduction in nested calls cost reductions: 1.3 for CCSD, 2.1 for CCSDT algorithms generalize to most antisymmetric tensor contractions for Hermitian tensors, multiplication cost 3X more than addition BLAS routines: hemm and her2k as well as LAPACK routines like hetrd (tridiagonal reduction) may be done with 25% fewer operations achieves (2/3)n 3 bilinear rank for squaring a nonsymmetric matrix, assuming elementwise commutativity allows blocking of symmetric contractions into smaller (anti)symmetric contractions 2 S., Demmel; Technical Report, ETH Zurich, Edgar Solomonik Algorithms as Multilinear Tensor Equations 17/40

26 Beyond computation cost Algorithms should be not only work-efficient but communication-efficient data movement and synchronization cost more energy than flops two types of data movement: vertical (intranode memory cache) and horizontal (internode network transfers) parallel algorithm design involves tradeoffs between computation, communication, and synchronization lower bounds and parameterized algorithms provide optimal solutions within a well-defined tuning space Edgar Solomonik Algorithms as Multilinear Tensor Equations 18/40

27 Cost model for parallel algorithms Given a schedule that specifies all work and communication tasks on p processors, we consider the following costs, accumulated along chains of tasks (as in α β, BSP, and LogGP models), F computation cost Q vertical communication cost W horizontal communication cost S synchronization cost Edgar Solomonik Algorithms as Multilinear Tensor Equations 19/40

28 Communication lower bounds: previous work Multiplication of n n matrices horizontal communication lower bound 3 W MM = Ω ( n 2 p 2/3 ) memory-dependent ( ) horizontal communication lower bound 4 W MM = Ω n 3 p M with M = cn 2 /p memory, can hope to obtain W = O(n 2 / cp) communication standard parallel libraries (ScaLAPACK, Elemental) optimal only for c = 1 3 Aggarwal, Chandra, Snir, TCS, Irony, Toledo, Tiskin, JPDC, 2004 Edgar Solomonik Algorithms as Multilinear Tensor Equations 20/40

29 Communication-efficient matrix multiplication Communication-optimal algorithms for matrix multiplication have been studied extensively 5 They continue to be attractive on modern architectures 6 Percentage of machine peak D MM on BG/P (n=65,536) 2.5D SUMMA 2D SUMMA ScaLAPACK PDGEMM Gigaflop/s/node Matrix multiplication strong scaling on Mira (BG/Q) 2.5D MM n=65,536 2D MM n=65, D MM n=16,384 2D MM n=16, #nodes #nodes 5 Berntsen, Par. Comp., 1989; Agarwal, Chandra, Snir, TCS, 1990; Agarwal, Balle, Gustavson, Joshi, Palkar, IBM, 1995; McColl, Tiskin, Algorithmica, 1999;... 6 S., Bhatele, Demmel, SC, 2011 Edgar Solomonik Algorithms as Multilinear Tensor Equations 21/40

30 Synchronization cost lower bounds Unlike matrix multiplication, many algorithms in numerical linear algebra have polynomial depth synchronization cost bounds 7 S MM = Θ ( n 3 pm 3/2 ) algorithms for Cholesky, LU, QR, SVD have additional dependencies lowering computation and communication costs, requires additional synchronization 7 Ballard, Demmel, Holtz, Schwartz, SIAM JMAA, 2011 Edgar Solomonik Algorithms as Multilinear Tensor Equations 22/40

31 Tradeoffs in the diamond DAG For the n n diamond DAG, there is a tradeoff between computation and synchronization costs 8 F S = Ω(n 2 ) We generalize such tradeoffs to consider horizontal communication and arbitrary (polynomial or exponential) interval expansion 9 8 Papadimitriou, Ullman, SIAM JC, S., Carson, Knight, Demmel, SPAA 2014 (extended version, JPDC 2016) Edgar Solomonik Algorithms as Multilinear Tensor Equations 23/40

32 Tradeoffs involving synchronization We apply tradeoff lower bounds to dense linear algebra algorithms, represented via dependency hypergraphs. a For triangular solve with an n n matrix For Cholesky of an n n matrix F TRSV S TRSV = Ω ( n 2) F CHOL S 2 CHOL = Ω ( n 3) W CHOL S CHOL = Ω ( n 2) Proof employs classical Loomis-Whitney inequality. For any R N N N, three projections of R onto N N have total size at least R 2/3 a S., Carson, Knight, Demmel, SPAA 2014 (extended version, JPDC 2016) Edgar Solomonik Algorithms as Multilinear Tensor Equations 24/40

33 Communication-efficient LU factorization For any c [1, p 1/3 ], use cn 2 /p memory per processor and obtain W LU = O(n 2 / cp), S LU = O( cp) Percentage of machine peak LU with tournament pivoting on BG/P (n=65,536) ideal scaling 2.5D LU 2D LU ScaLAPACK PDGETRF #nodes LU with pairwise pivoting 10 extended to tournament pivoting 11 first implementation of a communication-optimal LU algorithm 10 Tiskin, FGCS, S., Demmel, Euro-Par, 2011 Edgar Solomonik Algorithms as Multilinear Tensor Equations 25/40

34 Communication-efficient QR factorization W QR = O(n 2 / cp), S QR = O( cp) using Givens rotations 12 Householder form can be reconstructed quickly from TSQR 13 optimal QR communication and synchronization (modulo log factors) costs can be obtained with Householder representation 14 Householder aggregation yields performance improvements QR weak scaling on Hopper (15K-by-15K to 131K-by-131K) Teraflops Two-Level CAQR-HR Two-Level Householder CAQR-HR Elemental QR ScaLAPACK QR Scatter-Apply CAQR Binary-Tree CAQR #cores 12 Tiskin, FGCS, Ballard, Demmel, Grigori, Jacquelin, Nguyen, Diep, S., IPDPS, S., UCB, 2014 Edgar Solomonik Algorithms as Multilinear Tensor Equations 26/40

Communication-efficient eigenvalue computation For the dense symmetric matrix eigenvalue problem W SE = O(n 2 / cp), S QR = O( cp log 2 p) 15 optimal horizontal communication can be obtained with

35 Communication-efficient eigenvalue computation For the dense symmetric matrix eigenvalue problem W SE = O(n 2 / cp), S QR = O( cp log 2 p) 15 optimal horizontal communication can be obtained with left-looking algorithm and aggregation, however, requires more vertical communication successive band reduction can be used to minimize both communication costs 15 S., UCB, S., Hoefler, Demmel, in preparation Edgar Solomonik Algorithms as Multilinear Tensor Equations 27/40

36 Synchronization tradeoffs in stencils Our lower bound analysis extends also to sparse iterative methods. 16 For computing s applications of a (2m + 1) d -point stencil F St S d St = Ω ( m 2d s d+1) W St S d 1 St = Ω (m d s d) proof requires generalization of Loomis-Whitney inequality to order d set and order d 1 projections time-blocking lowers synchronization and vertical communication costs, but raises horizontal communication we suggest alternative approach that minimizes vertical and horizontal communication, but not synchronization 16 S., Carson, Knight, Demmel, SPAA 2014 (extended version, JPDC 2016) Edgar Solomonik Algorithms as Multilinear Tensor Equations 28/40

37 Beyond the Loomis-Whitney inequalities Loomis-Whitney inequalities are not sufficient for all computations symmetry preserving tensor contraction algorithms have arbitrary order projections from order d set bilinear algorithms 17 provide a more general framework a bilinear algorithm is defined by matrices F (A), F (B), F (C), c = F (C) [(F (A)T a) (F (B)T b)] where is the Hadamard (pointwise) product communication lower bounds can be formulated in terms of rank Pan, Springer, S., Hoefler, Demmel, in preparation Edgar Solomonik Algorithms as Multilinear Tensor Equations 29/40

38 Communication cost of symmetry preserving algorithms For contraction of order s + v tensor with order v + t tensor Υ is the nonsymmetric contraction algorithm Ψ is the best previously known algorithm Φ is the symmetry preserving algorithm Asymptotic communication lower bounds based on bilinear expansion a (H cache size, p #processors, n dimension): s t v F Υ F Ψ F Φ Q Υ,Ψ Q Φ W Υ W Ψ W Φ n 2 n 2 n2 2 n 2 n 2 n n 3 n3 n n 4 n n 3 n 3 n n 4 n n 6 n6 8 p 1/2 n p 1/2 n p 1/2 6 n 3 n 3 n n 2 n 2 p 2/3 p 2/3 n 4 24 n 4 n 4 n2 n 2 n 2 p 1/2 p 1/2 p 1/2 6 n 4 24 n n 3 H 1/2 n 3 H 1/2 n 2 p 2/3 n 2 p 2/3 n 2 p 2/3 n 4 n 4 n 2 n 2 n3 H 1/2 H 1/3 p 3/4 n 6 n 6 n 4 n 4 n 4 H 1/2 H 1/2 p 2/3 p 2/3 p 2/3 a S., Hoefler, Demmel, ETHZ, 2014 Edgar Solomonik Algorithms as Multilinear Tensor Equations 30/40

39 Open theoretical problems lower bounds for multiplication of a sparse and a dense matrix lower bounds for nested bilinear algorithms broader parameterizations of algorithmic representations needed for QR and SVD lower bounds Edgar Solomonik Algorithms as Multilinear Tensor Equations 31/40

40 Tensor algebra as a programming abstraction Cyclops Tensor Framework 19 contraction/summation/functions of tensors distributed symmetric-packed/sparse storage via cyclic layout parallelization via MPI+OpenMP(+CUDA) 19 S., Hammond, Demmel, UCB, S., Matthews, Hammond, Demmel, IPDPS, 2013 Edgar Solomonik Algorithms as Multilinear Tensor Equations 32/40

41 Tensor algebra as a programming abstraction Cyclops Tensor Framework contraction/summation/functions of tensors distributed symmetric-packed/sparse storage via cyclic layout parallelization via MPI+OpenMP(+CUDA) Jacobi iteration example code snippet void Jacobi ( Matrix <> A, Vector <> b, int n){ Matrix <> R(A); R["ii"] = 0.0; Vector <> x(n), d(n), r(n); Function <> inv ([]( double & d){ return 1./d; }); d["i"] = inv (A["ii"]); // set d to inverse of diagonal of A do { x["i"] = d["i"]*(b["i"]-r["ij"]*x["j" ]); r["i"] = b["i"]-a["ij"]*x["j"]; // compute residual } while ( r. norm2 () > 1. E -6); // check for convergence } Edgar Solomonik Algorithms as Multilinear Tensor Equations 32/40

42 Tensor algebra as a programming abstraction Cyclops Tensor Framework contraction/summation/functions of tensors distributed symmetric-packed/sparse storage via cyclic layout parallelization via MPI+OpenMP(+CUDA) Møller-Plesset perturbation theory (MP3) code snippet Z[" abij "] += Fab ["af"]*t[" fbij "]; Z[" abij "] -= Fij ["ni"]*t[" abnj "]; Z[" abij "] += 0.5 * Vabcd [" abef "]*T[" efij "]; Z[" abij "] += 0.5 * Vijkl [" mnij "]*T[" abmn "]; Z[" abij "] -= Vaibj [" amei "]*T[" ebmj "]; Edgar Solomonik Algorithms as Multilinear Tensor Equations 32/40

43 Betweenness centrality Betweenness centrality code snippet, for k of n nodes void btwn_central ( Matrix <int > A, Matrix <path > P, int n, int k){ Monoid <path > mon (..., []( path a, path b){ if ( a.w <b. w) return a; else if ( b.w <a. w) return b; else return path (a.w, a.m+b.m); },... ); Matrix < path > Q(n,k, mon ); // shortest path matrix Q["ij"] = P["ij"]; Function <int,path > append ([]( int w, path p){ return path (w+p.w, p.m); }; ); } for ( int i =0; i<n; i ++) Q["ij"] = append (A["ik"],Q["kj"]);... Edgar Solomonik Algorithms as Multilinear Tensor Equations 33/40

optimized redistribution kernels for tensor transposition Teraflop/s 2048 1024 512 256 128 64 32 16 CTF Scalapack BG/Q matrix

44 Performance of CTF for dense computations CTF is highly tuned for massively-parallel machines virtualized multidimensional processor grids topology-aware mapping and collective communication performance-model-driven decomposition done at runtime optimized redistribution kernels for tensor transposition Teraflop/s CTF Scalapack BG/Q matrix multiplication #cores Edgar Solomonik Algorithms as Multilinear Tensor Equations 34/40

45 Performance of CTF for sparse computations seconds/iteration MP3 leveraging sparse-dense tensor contractions a Weak scaling of MP3 (m=40, n=160 on 24 cores) dense 16% sparse 8% sparse 4% sparse 2% sparse 28 1% sparse #cores seconds/iteration Strong scaling of MP3 with m=40, n=160 dense 16% sparse 8% sparse 4% sparse 2% sparse 1% sparse #cores All-pairs shortest-paths based on path doubling with sparsification seconds Weak scaling of APSP (n=2k on 24 cores) regular path doubling sparse path doubling seconds Strong scaling of APSP with n=2k regular path doubling sparse path doubling #cores #cores Edgar Solomonik Algorithms as Multilinear Tensor Equations 35/40

46 Post-Hartree-Fock (HF) methods Accurate models of electronic correlation require approximation of contributions of excited-state transitions. Møller-Plesset methods provide perturbative corrections Coupled-cluster methods (CCSD, CCSDT, CCSDTQ) iteratively solve (2nd, 3rd, 4th) order equations in the state space tensor expressions naturally express high-order transitions tensor structure admits symmetries and sparsity permutational index antisymmetry due to antisymmetry of wavefunction sparsity due to strength of interactions diminishing with growing distance in the molecular orbital basis Edgar Solomonik Algorithms as Multilinear Tensor Equations 36/40

47 CCSD using CTF Extracted from Aquarius (Devin Matthews code, FMI ["mi"] += 0.5 * WMNEF [" mnef "]*T (2)[ " efin "]; WMNIJ [" mnij "] += 0.5 * WMNEF [" mnef "]*T (2)[ " efij "]; FAE ["ae"] -= 0.5 * WMNEF [" mnef "]*T (2)[ " afmn "]; WAMEI [" amei "] -= 0.5 * WMNEF [" mnef "]*T (2)[ " afin "]; Z (2)[ " abij "] = WMNEF [" ijab "]; Z (2)[ " abij "] += FAE ["af"]*t (2)[ " fbij "]; Z (2)[ " abij "] -= FMI ["ni"]*t (2)[ " abnj "]; Z (2)[ " abij "] += 0.5 * WABEF [" abef "]*T (2)[ " efij "]; Z (2)[ " abij "] += 0.5 * WMNIJ [" mnij "]*T (2)[ " abmn "]; Z (2)[ " abij "] -= WAMEI [" amei "]*T (2)[ " ebmj "]; Other electronic structure codes using CTF include QChem (via Libtensor) and VASP Edgar Solomonik Algorithms as Multilinear Tensor Equations 37/40

48 Comparison with NWChem NWChem is the most commonly-used distributed-memory quantum chemistry method suite provides CCSD and CCSDT uses Global Arrays a Partitioned Global Address Space (PGAS) backend for tensor contractions derives equations via Tensor Contraction Engine (TCE) Strong scaling CCSD on Edison Strong scaling CCSDT on Edison seconds NWChem w20 w15 w10 w8 Aquarius-CTF w20 w15 w10 w8 seconds NWChem w3 w2 Aquarius-CTF w4 w3 w #nodes #nodes Edgar Solomonik Algorithms as Multilinear Tensor Equations 38/40

49 Coupled cluster on IBM BlueGene/Q and Cray XC30 CCSD up to 55 (50) water molecules with cc-pvdz CCSDT up to 10 water molecules with cc-pvdz a Teraflops Weak scaling on BlueGene/Q Aquarius-CTF CCSD Aquarius-CTF CCSDT #nodes Gigaflops/node Weak scaling on BlueGene/Q Aquarius-CTF CCSD Aquarius-CTF CCSDT #nodes Teraflops Weak scaling on Edison Aquarius-CTF CCSD Aquarius-CTF CCSDT Gigaflops/node #nodes #nodes Edgar Solomonik Algorithms as Multilinear Tensor Equations 39/ Weak scaling on Edison Aquarius-CTF CCSD Aquarius-CTF CCSDT

50 Future work symmetry preserving algorithms high-performance implementations bilinear algorithm complexity fast matrix multiplication sparsity in tensor computations handling multiple sparse operands and sparse output worst-case lower bounds and efficient algorithms tensor algorithms most algorithms correspond to multiple dependent tensors operations scheduling, blocking, and decomposition of multiple tensor operations programming abstractions for tensor factorizations application-driven development tensor decompositions, sparsity, symmetry all motivated by electronic structure applications many further applications in tensor networks (DMRG), machine learning, etc. Edgar Solomonik Algorithms as Multilinear Tensor Equations 40/40

51 Backup slides Edgar Solomonik Algorithms as Multilinear Tensor Equations 41/40

52 Symmetry preserving algorithm vs Strassen s algorithm [n ω /(s!t!v!)] / #multiplications (speed-up over classical direct evaluation alg.) Symmetry preserving alg. vs Strassen s alg. (s=t=v=ω/3) Strassen s algorithm Sym. preserving ω=6 Sym. preserving ω= e+06 n s /s! (matrix dimension) Edgar Solomonik Algorithms as Multilinear Tensor Equations 42/40

53 Nesting of bilinear algorithms Given two bilinear algorithms: Λ 1 =(F (A) 1, F(B) 1, F(C) 1 ) Λ 2 =(F (A) 2, F(B) 2, F(C) 2 ) We can nest them by computing their tensor product Λ 1 Λ 2 :=(F (A) 1 F (A) 2, F(B) 1 F (B) 2, F(C) 1 F (C) 2 ) rank(λ 1 Λ 2 ) = rank(λ 1 ) rank(λ 2 ) Edgar Solomonik Algorithms as Multilinear Tensor Equations 43/40

54 Block-cyclic algorithm for s-step methods For s-steps of a (2m + 1) d -point stencil with block-size of H 1/d /m, ( ) msn d W Kr = O H 1/d p S Kr = O(sn d /(ph)) ( ) msn d Q Kr = O H 1/d p which are good when H = Θ(n d /p), so the algorithm is useful when the cache size is a bit smaller than n d /p Edgar Solomonik Algorithms as Multilinear Tensor Equations 44/40

55 2.5D LU on MIC Gigaflop/s/node LU factorization strong scaling on Stampede (MIC + Sandy Bridge) 2.5D hybrid LU n=131,072 2D hybrid LU n=131, D pure-cpu LU n=131, D hybrid LU n=65,536 2D hybrid LU n=65, D pure-cpu LU n=65, #nodes Edgar Solomonik Algorithms as Multilinear Tensor Equations 45/40

56 Topology-aware mapping on BG/Q LU factorization strong scaling on Mira (BG/Q), n=65,536 2D LU, custom mapping 2D LU, default mapping Gigaflop/s/node #nodes Edgar Solomonik Algorithms as Multilinear Tensor Equations 46/40

57 Symmetric matrix representation Edgar Solomonik Algorithms as Multilinear Tensor Equations 47/40

58 Blocked distributions of a symmetric matrix Edgar Solomonik Algorithms as Multilinear Tensor Equations 48/40

59 Cyclic distribution of a symmetric matrix Edgar Solomonik Algorithms as Multilinear Tensor Equations 49/40

60 Our CCSD factorization Credit to John F. Stanton and Jurgen Gauss τ ab ij = t ab ij Pa b Pi j t a i t b j, F m e = f m e + fn v mn ef t f n, F a e = (1 δ ae )f a e m F i m = (1 δ mi )fi m + e F m e t a m 1 2 F m e t e i mnf nef vef mn tmn af + fn vef mn tin ef + fn v an ef tf n, v mn if t f n, Edgar Solomonik Algorithms as Multilinear Tensor Equations 50/40

61 Our CCSD factorization W ei mn = vei mn + f v mn ef t f i, W ij mn = vij mn + Pj i W ie am = vie am n W ij am = vij am + Pj i zi a = fi a m 1 2 emn zij ab = vij ab + Pj i + P a b e e e v mn ie t e j W ei mn tn a + f v am ie t e j F i m tm a + e W ei mn tmn, ea e F a e t eb ij v ab ie t e j P i j f a e t e i + P a bp i j m ef vef mn τij ef, v ma ef t f i ef vef am τij ef, + em me F i m tmj ab nf vef mn tin af, vei ma tm e + v ae F im e m em W ie am tmj eb Pb a m ef v ab ef τ ef ij W ij am tm b mn efm W ij mn τmn, ab vef am τim ef Edgar Solomonik Algorithms as Multilinear Tensor Equations 51/40

62 Stability of symmetry preserving algorithms Edgar Solomonik Algorithms as Multilinear Tensor Equations 52/40

63 Performance breakdown on BG/Q Performance data for a CCSD iteration with 200 electrons and 1000 orbitals on 4096 nodes of Mira 4 processes per node, 16 threads per process Total time: 18 mins v-orbitals, o-electrons kernel % of time complexity architectural bounds DGEMM 45% O(v 4 o 2 /p) flops/mem bandwidth broadcasts 20% O(v 4 o 2 /p M) multicast bandwidth prefix sum 10% O(p) allreduce bandwidth data packing 7% O(v 2 o 2 /p) integer ops all-to-all-v 7% O(v 2 o 2 /p) bisection bandwidth tensor folding 4% O(v 2 o 2 /p) memory bandwidth Edgar Solomonik Algorithms as Multilinear Tensor Equations 53/40

64 Tiskin s path doubling algorithm Tiskin gives a way to do path-doubling in F = O(n 3 /p) operations. We can partition each A k by path size (number of edges) A k = I A k (1) A k (2)... A k (k) where each A k (l) contains the shortest paths of up to k l edges, which have exactly l edges. We can see that A l (l) A l+1 (l)... A n (l) = A (l), in particular A (l) corresponds to a sparse subset of A l (l). The algorithm works by picking l [k/2, k] and computing (I A) 3k/2 (I A k (l)) A k, which finds all paths of size up to 3k/2 by taking all paths of size exactly l k/2 followed by all paths of size up to k. Edgar Solomonik Algorithms as Multilinear Tensor Equations 54/40

Algorithms as multilinear tensor equations

Algorithms as multilinear tensor equations Edgar Solomonik Department of Computer Science ETH Zurich Technische Universität München 18.1.2016 Edgar Solomonik Algorithms as multilinear tensor equations