Algorithms as Multilinear Tensor Equations

Size: px
Start display at page:

Download "Algorithms as Multilinear Tensor Equations"

Transcription

1 Algorithms as Multilinear Tensor Equations Edgar Solomonik Department of Computer Science ETH Zurich Georgia Institute of Technology, Atlanta GA, USA February 18, 2016 Edgar Solomonik Algorithms as Multilinear Tensor Equations 1/40

2 Pervasive paradigms in scientific computing What commonalities exist among resource-intensive computations in simulation and data analysis? multidimensional datasets (observations, discretizations) higher-order relations between datasets, i.e. equations, maps, graphs, hypergraphs sparsity and symmetry in structure of relations relations lead to solution directly or by acting as an evolutionary (iterative) criterion algebraic descriptions of datasets and relations Edgar Solomonik Algorithms as Multilinear Tensor Equations 2/40

3 Pervasive paradigms in scientific computing What type of abstractions are desirable in high performance computing? data abstractions should reflect native dimensionality and structure global functional abstractions should efficiently orchestrate communication and synchronization abstractions should enable development of provably efficient algorithms Edgar Solomonik Algorithms as Multilinear Tensor Equations 3/40

4 Outline Introduction to tensor computations Symmetry-preserving tensor algorithms Communication-avoiding parallel algorithms A massively-parallel tensor framework Applications to electronic structure calculations Conclusion Edgar Solomonik Algorithms as Multilinear Tensor Equations 4/40

5 Basic data (vector) parallelism Edgar Solomonik Algorithms as Multilinear Tensor Equations 5/40

6 Basic data (vector) parallelism c i = f (a i ) Edgar Solomonik Algorithms as Multilinear Tensor Equations 6/40

7 Basic data (vector) parallelism c i = f (a i ) Edgar Solomonik Algorithms as Multilinear Tensor Equations 6/40

8 Dense matrix operators c = Ab Edgar Solomonik Algorithms as Multilinear Tensor Equations 7/40

9 Sparse matrix operators c = Ab Edgar Solomonik Algorithms as Multilinear Tensor Equations 8/40

10 Sparse matrix operators c i = j A ij b j Edgar Solomonik Algorithms as Multilinear Tensor Equations 8/40

11 Low-rank matrix factorizations A = UV c = UVb Edgar Solomonik Algorithms as Multilinear Tensor Equations 9/40

12 Low-rank matrix factorizations A ij = k U ik V kj c i = j,k U ik V kj b j Edgar Solomonik Algorithms as Multilinear Tensor Equations 9/40

13 Low-rank matrix factorizations c = UΣVb Edgar Solomonik Algorithms as Multilinear Tensor Equations 9/40

14 Tensor operator d i = j,k A ijk b j c k Edgar Solomonik Algorithms as Multilinear Tensor Equations 9/40

15 Tensor operator d i = j,k A ijk b j c k Edgar Solomonik Algorithms as Multilinear Tensor Equations 10/40

16 Low-rank tensor factorizations (CP) A ijk = l U il V jl W kl d i = j,k,l U il V jl W kl b j c k Edgar Solomonik Algorithms as Multilinear Tensor Equations 11/40

17 Low-rank tensor factorizations (CP) A ijk = l U il V jl W kl d i = j,k,l U il V jl W kl b j c k Edgar Solomonik Algorithms as Multilinear Tensor Equations 11/40

18 Low-rank tensor factorizations (Tucker) A ijk = T lmn U il V jm W kn d i = T lmn U il V jm W jn b j c k l,m,n j,k,l,m,n Edgar Solomonik Algorithms as Multilinear Tensor Equations 12/40

19 Low-rank tensor factorizations (Tucker) A jk i = T lmn U il V jm W kn d i = T lmn U il V jm W kn b j c k l,m,n j,k,l,m,n Edgar Solomonik Algorithms as Multilinear Tensor Equations 12/40

20 Low-rank tensor factorizations (TT) A ijk = l,m U il V ljm W mk U il V ljm W mk b j c k j,k,l,m Edgar Solomonik Algorithms as Multilinear Tensor Equations 13/40

21 Low-rank tensor factorizations (TT) A ijk = l,m U il V ljm W mk U il V ljm W mk b j c k j,k,l,m Edgar Solomonik Algorithms as Multilinear Tensor Equations 13/40

22 Tensor contractions in electronic structure methods Amplitude equation snippet from coupled cluster doubles model 0 = V abij + T abik F kj + T abkl V klmn T mnij +... k k,l,m,n Tensor hypercontraction representation a T abij = l,m ψ al φ jl Z lm φ jm ψ bm a Hohenstein, Parrish, Sherrill, Martinez, JCP, 2012 Edgar Solomonik Algorithms as Multilinear Tensor Equations 14/40

23 Structured tensor computations Challenges for matrix/tensor algebraic abstractions data and relation sparsity tensor sparsity low-order representations of data tensor decompositions implicitly defined relations implicit tensor representations data and relation equivalences tensor symmetries Edgar Solomonik Algorithms as Multilinear Tensor Equations 15/40

24 Exploiting symmetry in tensors Tensor symmetry (e.g. A ij = A ji ) reduces memory and cost for order d tensor, d! less memory dot product i,j A ijb ij = 2 i<j A ijb ij + i A iib ii matrix-vector multiplication 1 c i = A ij b j = ( A ij (b i + b j ) A ij )b i j j j rank-2 vector outer product 1 C ij = a i b j + a j b i = (a i + a j )(b i + b j ) a i b i a j b j squaring a symmetric matrix (or AB + BA) 1 C ij = k A ik A kj = k (A ik + A kj + A ij ) 2... for order ω contraction, ω! fewer multiplications 1 1 S., Demmel; Technical Report, ETH Zurich, Edgar Solomonik Algorithms as Multilinear Tensor Equations 16/40

25 Symmetry preserving algorithms By exploiting symmetry, we can reduce the number of multiplications at the cost of more additions 2 partially symmetric contractions symmetry preserving algorithm can be nested over each index group reduction in multiplications implies reduction in nested calls cost reductions: 1.3 for CCSD, 2.1 for CCSDT algorithms generalize to most antisymmetric tensor contractions for Hermitian tensors, multiplication cost 3X more than addition BLAS routines: hemm and her2k as well as LAPACK routines like hetrd (tridiagonal reduction) may be done with 25% fewer operations achieves (2/3)n 3 bilinear rank for squaring a nonsymmetric matrix, assuming elementwise commutativity allows blocking of symmetric contractions into smaller (anti)symmetric contractions 2 S., Demmel; Technical Report, ETH Zurich, Edgar Solomonik Algorithms as Multilinear Tensor Equations 17/40

26 Beyond computation cost Algorithms should be not only work-efficient but communication-efficient data movement and synchronization cost more energy than flops two types of data movement: vertical (intranode memory cache) and horizontal (internode network transfers) parallel algorithm design involves tradeoffs between computation, communication, and synchronization lower bounds and parameterized algorithms provide optimal solutions within a well-defined tuning space Edgar Solomonik Algorithms as Multilinear Tensor Equations 18/40

27 Cost model for parallel algorithms Given a schedule that specifies all work and communication tasks on p processors, we consider the following costs, accumulated along chains of tasks (as in α β, BSP, and LogGP models), F computation cost Q vertical communication cost W horizontal communication cost S synchronization cost Edgar Solomonik Algorithms as Multilinear Tensor Equations 19/40

28 Communication lower bounds: previous work Multiplication of n n matrices horizontal communication lower bound 3 W MM = Ω ( n 2 p 2/3 ) memory-dependent ( ) horizontal communication lower bound 4 W MM = Ω n 3 p M with M = cn 2 /p memory, can hope to obtain W = O(n 2 / cp) communication standard parallel libraries (ScaLAPACK, Elemental) optimal only for c = 1 3 Aggarwal, Chandra, Snir, TCS, Irony, Toledo, Tiskin, JPDC, 2004 Edgar Solomonik Algorithms as Multilinear Tensor Equations 20/40

29 Communication-efficient matrix multiplication Communication-optimal algorithms for matrix multiplication have been studied extensively 5 They continue to be attractive on modern architectures 6 Percentage of machine peak D MM on BG/P (n=65,536) 2.5D SUMMA 2D SUMMA ScaLAPACK PDGEMM Gigaflop/s/node Matrix multiplication strong scaling on Mira (BG/Q) 2.5D MM n=65,536 2D MM n=65, D MM n=16,384 2D MM n=16, #nodes #nodes 5 Berntsen, Par. Comp., 1989; Agarwal, Chandra, Snir, TCS, 1990; Agarwal, Balle, Gustavson, Joshi, Palkar, IBM, 1995; McColl, Tiskin, Algorithmica, 1999;... 6 S., Bhatele, Demmel, SC, 2011 Edgar Solomonik Algorithms as Multilinear Tensor Equations 21/40

30 Synchronization cost lower bounds Unlike matrix multiplication, many algorithms in numerical linear algebra have polynomial depth synchronization cost bounds 7 S MM = Θ ( n 3 pm 3/2 ) algorithms for Cholesky, LU, QR, SVD have additional dependencies lowering computation and communication costs, requires additional synchronization 7 Ballard, Demmel, Holtz, Schwartz, SIAM JMAA, 2011 Edgar Solomonik Algorithms as Multilinear Tensor Equations 22/40

31 Tradeoffs in the diamond DAG For the n n diamond DAG, there is a tradeoff between computation and synchronization costs 8 F S = Ω(n 2 ) We generalize such tradeoffs to consider horizontal communication and arbitrary (polynomial or exponential) interval expansion 9 8 Papadimitriou, Ullman, SIAM JC, S., Carson, Knight, Demmel, SPAA 2014 (extended version, JPDC 2016) Edgar Solomonik Algorithms as Multilinear Tensor Equations 23/40

32 Tradeoffs involving synchronization We apply tradeoff lower bounds to dense linear algebra algorithms, represented via dependency hypergraphs. a For triangular solve with an n n matrix For Cholesky of an n n matrix F TRSV S TRSV = Ω ( n 2) F CHOL S 2 CHOL = Ω ( n 3) W CHOL S CHOL = Ω ( n 2) Proof employs classical Loomis-Whitney inequality. For any R N N N, three projections of R onto N N have total size at least R 2/3 a S., Carson, Knight, Demmel, SPAA 2014 (extended version, JPDC 2016) Edgar Solomonik Algorithms as Multilinear Tensor Equations 24/40

33 Communication-efficient LU factorization For any c [1, p 1/3 ], use cn 2 /p memory per processor and obtain W LU = O(n 2 / cp), S LU = O( cp) Percentage of machine peak LU with tournament pivoting on BG/P (n=65,536) ideal scaling 2.5D LU 2D LU ScaLAPACK PDGETRF #nodes LU with pairwise pivoting 10 extended to tournament pivoting 11 first implementation of a communication-optimal LU algorithm 10 Tiskin, FGCS, S., Demmel, Euro-Par, 2011 Edgar Solomonik Algorithms as Multilinear Tensor Equations 25/40

34 Communication-efficient QR factorization W QR = O(n 2 / cp), S QR = O( cp) using Givens rotations 12 Householder form can be reconstructed quickly from TSQR 13 optimal QR communication and synchronization (modulo log factors) costs can be obtained with Householder representation 14 Householder aggregation yields performance improvements QR weak scaling on Hopper (15K-by-15K to 131K-by-131K) Teraflops Two-Level CAQR-HR Two-Level Householder CAQR-HR Elemental QR ScaLAPACK QR Scatter-Apply CAQR Binary-Tree CAQR #cores 12 Tiskin, FGCS, Ballard, Demmel, Grigori, Jacquelin, Nguyen, Diep, S., IPDPS, S., UCB, 2014 Edgar Solomonik Algorithms as Multilinear Tensor Equations 26/40

35 Communication-efficient eigenvalue computation For the dense symmetric matrix eigenvalue problem W SE = O(n 2 / cp), S QR = O( cp log 2 p) 15 optimal horizontal communication can be obtained with left-looking algorithm and aggregation, however, requires more vertical communication successive band reduction can be used to minimize both communication costs 15 S., UCB, S., Hoefler, Demmel, in preparation Edgar Solomonik Algorithms as Multilinear Tensor Equations 27/40

36 Synchronization tradeoffs in stencils Our lower bound analysis extends also to sparse iterative methods. 16 For computing s applications of a (2m + 1) d -point stencil F St S d St = Ω ( m 2d s d+1) W St S d 1 St = Ω (m d s d) proof requires generalization of Loomis-Whitney inequality to order d set and order d 1 projections time-blocking lowers synchronization and vertical communication costs, but raises horizontal communication we suggest alternative approach that minimizes vertical and horizontal communication, but not synchronization 16 S., Carson, Knight, Demmel, SPAA 2014 (extended version, JPDC 2016) Edgar Solomonik Algorithms as Multilinear Tensor Equations 28/40

37 Beyond the Loomis-Whitney inequalities Loomis-Whitney inequalities are not sufficient for all computations symmetry preserving tensor contraction algorithms have arbitrary order projections from order d set bilinear algorithms 17 provide a more general framework a bilinear algorithm is defined by matrices F (A), F (B), F (C), c = F (C) [(F (A)T a) (F (B)T b)] where is the Hadamard (pointwise) product communication lower bounds can be formulated in terms of rank Pan, Springer, S., Hoefler, Demmel, in preparation Edgar Solomonik Algorithms as Multilinear Tensor Equations 29/40

38 Communication cost of symmetry preserving algorithms For contraction of order s + v tensor with order v + t tensor Υ is the nonsymmetric contraction algorithm Ψ is the best previously known algorithm Φ is the symmetry preserving algorithm Asymptotic communication lower bounds based on bilinear expansion a (H cache size, p #processors, n dimension): s t v F Υ F Ψ F Φ Q Υ,Ψ Q Φ W Υ W Ψ W Φ n 2 n 2 n2 2 n 2 n 2 n n 3 n3 n n 4 n n 3 n 3 n n 4 n n 6 n6 8 p 1/2 n p 1/2 n p 1/2 6 n 3 n 3 n n 2 n 2 p 2/3 p 2/3 n 4 24 n 4 n 4 n2 n 2 n 2 p 1/2 p 1/2 p 1/2 6 n 4 24 n n 3 H 1/2 n 3 H 1/2 n 2 p 2/3 n 2 p 2/3 n 2 p 2/3 n 4 n 4 n 2 n 2 n3 H 1/2 H 1/3 p 3/4 n 6 n 6 n 4 n 4 n 4 H 1/2 H 1/2 p 2/3 p 2/3 p 2/3 a S., Hoefler, Demmel, ETHZ, 2014 Edgar Solomonik Algorithms as Multilinear Tensor Equations 30/40

39 Open theoretical problems lower bounds for multiplication of a sparse and a dense matrix lower bounds for nested bilinear algorithms broader parameterizations of algorithmic representations needed for QR and SVD lower bounds Edgar Solomonik Algorithms as Multilinear Tensor Equations 31/40

40 Tensor algebra as a programming abstraction Cyclops Tensor Framework 19 contraction/summation/functions of tensors distributed symmetric-packed/sparse storage via cyclic layout parallelization via MPI+OpenMP(+CUDA) 19 S., Hammond, Demmel, UCB, S., Matthews, Hammond, Demmel, IPDPS, 2013 Edgar Solomonik Algorithms as Multilinear Tensor Equations 32/40

41 Tensor algebra as a programming abstraction Cyclops Tensor Framework contraction/summation/functions of tensors distributed symmetric-packed/sparse storage via cyclic layout parallelization via MPI+OpenMP(+CUDA) Jacobi iteration example code snippet void Jacobi ( Matrix <> A, Vector <> b, int n){ Matrix <> R(A); R["ii"] = 0.0; Vector <> x(n), d(n), r(n); Function <> inv ([]( double & d){ return 1./d; }); d["i"] = inv (A["ii"]); // set d to inverse of diagonal of A do { x["i"] = d["i"]*(b["i"]-r["ij"]*x["j" ]); r["i"] = b["i"]-a["ij"]*x["j"]; // compute residual } while ( r. norm2 () > 1. E -6); // check for convergence } Edgar Solomonik Algorithms as Multilinear Tensor Equations 32/40

42 Tensor algebra as a programming abstraction Cyclops Tensor Framework contraction/summation/functions of tensors distributed symmetric-packed/sparse storage via cyclic layout parallelization via MPI+OpenMP(+CUDA) Møller-Plesset perturbation theory (MP3) code snippet Z[" abij "] += Fab ["af"]*t[" fbij "]; Z[" abij "] -= Fij ["ni"]*t[" abnj "]; Z[" abij "] += 0.5 * Vabcd [" abef "]*T[" efij "]; Z[" abij "] += 0.5 * Vijkl [" mnij "]*T[" abmn "]; Z[" abij "] -= Vaibj [" amei "]*T[" ebmj "]; Edgar Solomonik Algorithms as Multilinear Tensor Equations 32/40

43 Betweenness centrality Betweenness centrality code snippet, for k of n nodes void btwn_central ( Matrix <int > A, Matrix <path > P, int n, int k){ Monoid <path > mon (..., []( path a, path b){ if ( a.w <b. w) return a; else if ( b.w <a. w) return b; else return path (a.w, a.m+b.m); },... ); Matrix < path > Q(n,k, mon ); // shortest path matrix Q["ij"] = P["ij"]; Function <int,path > append ([]( int w, path p){ return path (w+p.w, p.m); }; ); } for ( int i =0; i<n; i ++) Q["ij"] = append (A["ik"],Q["kj"]);... Edgar Solomonik Algorithms as Multilinear Tensor Equations 33/40

44 Performance of CTF for dense computations CTF is highly tuned for massively-parallel machines virtualized multidimensional processor grids topology-aware mapping and collective communication performance-model-driven decomposition done at runtime optimized redistribution kernels for tensor transposition Teraflop/s CTF Scalapack BG/Q matrix multiplication #cores Edgar Solomonik Algorithms as Multilinear Tensor Equations 34/40

45 Performance of CTF for sparse computations seconds/iteration MP3 leveraging sparse-dense tensor contractions a Weak scaling of MP3 (m=40, n=160 on 24 cores) dense 16% sparse 8% sparse 4% sparse 2% sparse 28 1% sparse #cores seconds/iteration Strong scaling of MP3 with m=40, n=160 dense 16% sparse 8% sparse 4% sparse 2% sparse 1% sparse #cores All-pairs shortest-paths based on path doubling with sparsification seconds Weak scaling of APSP (n=2k on 24 cores) regular path doubling sparse path doubling seconds Strong scaling of APSP with n=2k regular path doubling sparse path doubling #cores #cores Edgar Solomonik Algorithms as Multilinear Tensor Equations 35/40

46 Post-Hartree-Fock (HF) methods Accurate models of electronic correlation require approximation of contributions of excited-state transitions. Møller-Plesset methods provide perturbative corrections Coupled-cluster methods (CCSD, CCSDT, CCSDTQ) iteratively solve (2nd, 3rd, 4th) order equations in the state space tensor expressions naturally express high-order transitions tensor structure admits symmetries and sparsity permutational index antisymmetry due to antisymmetry of wavefunction sparsity due to strength of interactions diminishing with growing distance in the molecular orbital basis Edgar Solomonik Algorithms as Multilinear Tensor Equations 36/40

47 CCSD using CTF Extracted from Aquarius (Devin Matthews code, FMI ["mi"] += 0.5 * WMNEF [" mnef "]*T (2)[ " efin "]; WMNIJ [" mnij "] += 0.5 * WMNEF [" mnef "]*T (2)[ " efij "]; FAE ["ae"] -= 0.5 * WMNEF [" mnef "]*T (2)[ " afmn "]; WAMEI [" amei "] -= 0.5 * WMNEF [" mnef "]*T (2)[ " afin "]; Z (2)[ " abij "] = WMNEF [" ijab "]; Z (2)[ " abij "] += FAE ["af"]*t (2)[ " fbij "]; Z (2)[ " abij "] -= FMI ["ni"]*t (2)[ " abnj "]; Z (2)[ " abij "] += 0.5 * WABEF [" abef "]*T (2)[ " efij "]; Z (2)[ " abij "] += 0.5 * WMNIJ [" mnij "]*T (2)[ " abmn "]; Z (2)[ " abij "] -= WAMEI [" amei "]*T (2)[ " ebmj "]; Other electronic structure codes using CTF include QChem (via Libtensor) and VASP Edgar Solomonik Algorithms as Multilinear Tensor Equations 37/40

48 Comparison with NWChem NWChem is the most commonly-used distributed-memory quantum chemistry method suite provides CCSD and CCSDT uses Global Arrays a Partitioned Global Address Space (PGAS) backend for tensor contractions derives equations via Tensor Contraction Engine (TCE) Strong scaling CCSD on Edison Strong scaling CCSDT on Edison seconds NWChem w20 w15 w10 w8 Aquarius-CTF w20 w15 w10 w8 seconds NWChem w3 w2 Aquarius-CTF w4 w3 w #nodes #nodes Edgar Solomonik Algorithms as Multilinear Tensor Equations 38/40

49 Coupled cluster on IBM BlueGene/Q and Cray XC30 CCSD up to 55 (50) water molecules with cc-pvdz CCSDT up to 10 water molecules with cc-pvdz a Teraflops Weak scaling on BlueGene/Q Aquarius-CTF CCSD Aquarius-CTF CCSDT #nodes Gigaflops/node Weak scaling on BlueGene/Q Aquarius-CTF CCSD Aquarius-CTF CCSDT #nodes Teraflops Weak scaling on Edison Aquarius-CTF CCSD Aquarius-CTF CCSDT Gigaflops/node #nodes #nodes Edgar Solomonik Algorithms as Multilinear Tensor Equations 39/ Weak scaling on Edison Aquarius-CTF CCSD Aquarius-CTF CCSDT

50 Future work symmetry preserving algorithms high-performance implementations bilinear algorithm complexity fast matrix multiplication sparsity in tensor computations handling multiple sparse operands and sparse output worst-case lower bounds and efficient algorithms tensor algorithms most algorithms correspond to multiple dependent tensors operations scheduling, blocking, and decomposition of multiple tensor operations programming abstractions for tensor factorizations application-driven development tensor decompositions, sparsity, symmetry all motivated by electronic structure applications many further applications in tensor networks (DMRG), machine learning, etc. Edgar Solomonik Algorithms as Multilinear Tensor Equations 40/40

51 Backup slides Edgar Solomonik Algorithms as Multilinear Tensor Equations 41/40

52 Symmetry preserving algorithm vs Strassen s algorithm [n ω /(s!t!v!)] / #multiplications (speed-up over classical direct evaluation alg.) Symmetry preserving alg. vs Strassen s alg. (s=t=v=ω/3) Strassen s algorithm Sym. preserving ω=6 Sym. preserving ω= e+06 n s /s! (matrix dimension) Edgar Solomonik Algorithms as Multilinear Tensor Equations 42/40

53 Nesting of bilinear algorithms Given two bilinear algorithms: Λ 1 =(F (A) 1, F(B) 1, F(C) 1 ) Λ 2 =(F (A) 2, F(B) 2, F(C) 2 ) We can nest them by computing their tensor product Λ 1 Λ 2 :=(F (A) 1 F (A) 2, F(B) 1 F (B) 2, F(C) 1 F (C) 2 ) rank(λ 1 Λ 2 ) = rank(λ 1 ) rank(λ 2 ) Edgar Solomonik Algorithms as Multilinear Tensor Equations 43/40

54 Block-cyclic algorithm for s-step methods For s-steps of a (2m + 1) d -point stencil with block-size of H 1/d /m, ( ) msn d W Kr = O H 1/d p S Kr = O(sn d /(ph)) ( ) msn d Q Kr = O H 1/d p which are good when H = Θ(n d /p), so the algorithm is useful when the cache size is a bit smaller than n d /p Edgar Solomonik Algorithms as Multilinear Tensor Equations 44/40

55 2.5D LU on MIC Gigaflop/s/node LU factorization strong scaling on Stampede (MIC + Sandy Bridge) 2.5D hybrid LU n=131,072 2D hybrid LU n=131, D pure-cpu LU n=131, D hybrid LU n=65,536 2D hybrid LU n=65, D pure-cpu LU n=65, #nodes Edgar Solomonik Algorithms as Multilinear Tensor Equations 45/40

56 Topology-aware mapping on BG/Q LU factorization strong scaling on Mira (BG/Q), n=65,536 2D LU, custom mapping 2D LU, default mapping Gigaflop/s/node #nodes Edgar Solomonik Algorithms as Multilinear Tensor Equations 46/40

57 Symmetric matrix representation Edgar Solomonik Algorithms as Multilinear Tensor Equations 47/40

58 Blocked distributions of a symmetric matrix Edgar Solomonik Algorithms as Multilinear Tensor Equations 48/40

59 Cyclic distribution of a symmetric matrix Edgar Solomonik Algorithms as Multilinear Tensor Equations 49/40

60 Our CCSD factorization Credit to John F. Stanton and Jurgen Gauss τ ab ij = t ab ij Pa b Pi j t a i t b j, F m e = f m e + fn v mn ef t f n, F a e = (1 δ ae )f a e m F i m = (1 δ mi )fi m + e F m e t a m 1 2 F m e t e i mnf nef vef mn tmn af + fn vef mn tin ef + fn v an ef tf n, v mn if t f n, Edgar Solomonik Algorithms as Multilinear Tensor Equations 50/40

61 Our CCSD factorization W ei mn = vei mn + f v mn ef t f i, W ij mn = vij mn + Pj i W ie am = vie am n W ij am = vij am + Pj i zi a = fi a m 1 2 emn zij ab = vij ab + Pj i + P a b e e e v mn ie t e j W ei mn tn a + f v am ie t e j F i m tm a + e W ei mn tmn, ea e F a e t eb ij v ab ie t e j P i j f a e t e i + P a bp i j m ef vef mn τij ef, v ma ef t f i ef vef am τij ef, + em me F i m tmj ab nf vef mn tin af, vei ma tm e + v ae F im e m em W ie am tmj eb Pb a m ef v ab ef τ ef ij W ij am tm b mn efm W ij mn τmn, ab vef am τim ef Edgar Solomonik Algorithms as Multilinear Tensor Equations 51/40

62 Stability of symmetry preserving algorithms Edgar Solomonik Algorithms as Multilinear Tensor Equations 52/40

63 Performance breakdown on BG/Q Performance data for a CCSD iteration with 200 electrons and 1000 orbitals on 4096 nodes of Mira 4 processes per node, 16 threads per process Total time: 18 mins v-orbitals, o-electrons kernel % of time complexity architectural bounds DGEMM 45% O(v 4 o 2 /p) flops/mem bandwidth broadcasts 20% O(v 4 o 2 /p M) multicast bandwidth prefix sum 10% O(p) allreduce bandwidth data packing 7% O(v 2 o 2 /p) integer ops all-to-all-v 7% O(v 2 o 2 /p) bisection bandwidth tensor folding 4% O(v 2 o 2 /p) memory bandwidth Edgar Solomonik Algorithms as Multilinear Tensor Equations 53/40

64 Tiskin s path doubling algorithm Tiskin gives a way to do path-doubling in F = O(n 3 /p) operations. We can partition each A k by path size (number of edges) A k = I A k (1) A k (2)... A k (k) where each A k (l) contains the shortest paths of up to k l edges, which have exactly l edges. We can see that A l (l) A l+1 (l)... A n (l) = A (l), in particular A (l) corresponds to a sparse subset of A l (l). The algorithm works by picking l [k/2, k] and computing (I A) 3k/2 (I A k (l)) A k, which finds all paths of size up to 3k/2 by taking all paths of size exactly l k/2 followed by all paths of size up to k. Edgar Solomonik Algorithms as Multilinear Tensor Equations 54/40

Algorithms as multilinear tensor equations

Algorithms as multilinear tensor equations Algorithms as multilinear tensor equations Edgar Solomonik Department of Computer Science ETH Zurich Technische Universität München 18.1.2016 Edgar Solomonik Algorithms as multilinear tensor equations

More information

Algorithms as Multilinear Tensor Equations

Algorithms as Multilinear Tensor Equations Algorithms as Multilinear Tensor Equations Edgar Solomonik Department of Computer Science ETH Zurich University of Colorado, Boulder February 23, 2016 Edgar Solomonik Algorithms as Multilinear Tensor Equations

More information

Algorithms as Multilinear Tensor Equations

Algorithms as Multilinear Tensor Equations Algorithms as Multilinear Tensor Equations Edgar Solomonik Department of Computer Science ETH Zurich University of Illinois March 9, 2016 Algorithms as Multilinear Tensor Equations 1/36 Pervasive paradigms

More information

Leveraging sparsity and symmetry in parallel tensor contractions

Leveraging sparsity and symmetry in parallel tensor contractions Leveraging sparsity and symmetry in parallel tensor contractions Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CCQ Tensor Network Workshop Flatiron Institute,

More information

A Distributed Memory Library for Sparse Tensor Functions and Contractions

A Distributed Memory Library for Sparse Tensor Functions and Contractions A Distributed Memory Library for Sparse Tensor Functions and Contractions Edgar Solomonik University of Illinois at Urbana-Champaign February 27, 2017 Cyclops Tensor Framework https://github.com/solomonik/ctf

More information

Low Rank Bilinear Algorithms for Symmetric Tensor Contractions

Low Rank Bilinear Algorithms for Symmetric Tensor Contractions Low Rank Bilinear Algorithms for Symmetric Tensor Contractions Edgar Solomonik ETH Zurich SIAM PP, Paris, France April 14, 2016 1 / 14 Fast Algorithms for Symmetric Tensor Contractions Tensor contractions

More information

Towards an algebraic formalism for scalable numerical algorithms

Towards an algebraic formalism for scalable numerical algorithms Towards an algebraic formalism for scalable numerical algorithms Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June 22, 2017 Householder

More information

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June

More information

Provably efficient algorithms for numerical tensor algebra

Provably efficient algorithms for numerical tensor algebra Provably efficient algorithms for numerical tensor algebra Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley Dissertation Defense Adviser: James Demmel Aug 27, 2014 Edgar Solomonik

More information

Scalability and programming productivity for massively-parallel wavefunction methods

Scalability and programming productivity for massively-parallel wavefunction methods Scalability and programming productivity for massively-parallel wavefunction methods Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign American Chemical Society

More information

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

Cyclops Tensor Framework

Cyclops Tensor Framework Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

Strassen-like algorithms for symmetric tensor contractions

Strassen-like algorithms for symmetric tensor contractions Strassen-like algorithms for symmetric tensor contractions Edgar Solomonik Theory Seminar University of Illinois at Urbana-Champaign September 18, 2017 1 / 28 Fast symmetric tensor contractions Outline

More information

Communication-avoiding parallel algorithms for dense linear algebra

Communication-avoiding parallel algorithms for dense linear algebra Communication-avoiding parallel algorithms 1/ 49 Communication-avoiding parallel algorithms for dense linear algebra Edgar Solomonik Department of EECS, C Berkeley June 2013 Communication-avoiding parallel

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

Communication-Avoiding Parallel Algorithms for Dense Linear Algebra and Tensor Computations

Communication-Avoiding Parallel Algorithms for Dense Linear Algebra and Tensor Computations Introduction Communication-Avoiding Parallel Algorithms for Dense Linear Algebra and Tensor Computations Edgar Solomonik Department of EECS, C Berkeley February, 2013 Edgar Solomonik Communication-avoiding

More information

Scalable numerical algorithms for electronic structure calculations

Scalable numerical algorithms for electronic structure calculations Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster

More information

Strassen-like algorithms for symmetric tensor contractions

Strassen-like algorithms for symmetric tensor contractions Strassen-lie algorithms for symmetric tensor contractions Edgar Solomoni University of Illinois at Urbana-Champaign Scientfic and Statistical Computing Seminar University of Chicago April 13, 2017 1 /

More information

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel Department of EECS, UC Berkeley February,

More information

Provably Efficient Algorithms for Numerical Tensor Algebra

Provably Efficient Algorithms for Numerical Tensor Algebra Provably Efficient Algorithms for Numerical Tensor Algebra Edgar Solomonik Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2014-170 http://www.eecs.berkeley.edu/pubs/techrpts/2014/eecs-2014-170.html

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Avoiding Communication in Distributed-Memory Tridiagonalization

Avoiding Communication in Distributed-Memory Tridiagonalization Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura

More information

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik Devin Matthews Jeff Hammond James Demmel Electrical Engineering and Computer

More information

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik Erin Carson Nicholas Knight James Demmel Electrical Engineering and Computer Sciences

More information

Classical Computer Science and Quantum Computing: High Performance Computing and Quantum Simulation

Classical Computer Science and Quantum Computing: High Performance Computing and Quantum Simulation Classical Computer Science and Quantum Computing: High Performance Computing and Quantum Simulation Edgar Solomonik Department of Computer Science, University of Illinois at Urbana-Champaign NSF Workshop:

More information

Contracting symmetric tensors using fewer multiplications

Contracting symmetric tensors using fewer multiplications Research Collection Journal Article Contracting symmetric tensors using fewer multiplications Authors: Solomonik, Edgar; Demmel, James W. Publication Date: 2016 Permanent Link: https://doi.org/10.3929/ethz-a-010345741

More information

Using BLIS for tensor computations in Q-Chem

Using BLIS for tensor computations in Q-Chem Using BLIS for tensor computations in Q-Chem Evgeny Epifanovsky Q-Chem BLIS Retreat, September 19 20, 2016 Q-Chem is an integrated software suite for modeling the properties of molecular systems from first

More information

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik Devin Matthews Jeff Hammond James Demmel Electrical Engineering and Computer

More information

Software implementation of correlated quantum chemistry methods. Exploiting advanced programming tools and new computer architectures

Software implementation of correlated quantum chemistry methods. Exploiting advanced programming tools and new computer architectures Software implementation of correlated quantum chemistry methods. Exploiting advanced programming tools and new computer architectures Evgeny Epifanovsky Q-Chem Septermber 29, 2015 Acknowledgments Many

More information

Introduction to communication avoiding linear algebra algorithms in high performance computing

Introduction to communication avoiding linear algebra algorithms in high performance computing Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for

More information

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version 1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2

More information

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast

More information

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

Avoiding Communication in Linear Algebra

Avoiding Communication in Linear Algebra Avoiding Communication in Linear Algebra Grey Ballard Microsoft Research Faculty Summit July 15, 2014 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation,

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Eigenvalue Problems Section 5.1 Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael

More information

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU LU Factorization To further improve the efficiency of solving linear systems Factorizations of matrix A : LU and QR LU Factorization Methods: Using basic Gaussian Elimination (GE) Factorization of Tridiagonal

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

A preliminary analysis of Cyclops Tensor Framework

A preliminary analysis of Cyclops Tensor Framework A preliminary analysis of Cyclops Tensor Framework Edgar Solomonik Jeff Hammond James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-29

More information

Strassen s Algorithm for Tensor Contraction

Strassen s Algorithm for Tensor Contraction Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,

More information

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for 1 Logistics Notes for 2016-08-26 1. Our enrollment is at 50, and there are still a few students who want to get in. We only have 50 seats in the room, and I cannot increase the cap further. So if you are

More information

Dense LU factorization and its error analysis

Dense LU factorization and its error analysis Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,

More information

A practical view on linear algebra tools

A practical view on linear algebra tools A practical view on linear algebra tools Evgeny Epifanovsky University of Southern California University of California, Berkeley Q-Chem September 25, 2014 What is Q-Chem? Established in 1993, first release

More information

Practicality of Large Scale Fast Matrix Multiplication

Practicality of Large Scale Fast Matrix Multiplication Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar

More information

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February

More information

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8

More information

Communication Lower Bounds for Programs that Access Arrays

Communication Lower Bounds for Programs that Access Arrays Communication Lower Bounds for Programs that Access Arrays Nicholas Knight, Michael Christ, James Demmel, Thomas Scanlon, Katherine Yelick UC-Berkeley Scientific Computing and Matrix Computations Seminar

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

Parallel Eigensolver Performance on High Performance Computers

Parallel Eigensolver Performance on High Performance Computers Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization

More information

Minimizing Communication in Linear Algebra

Minimizing Communication in Linear Algebra inimizing Communication in Linear Algebra Grey Ballard James Demmel lga Holtz ded Schwartz Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2009-62

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Rank revealing factorizations, and low rank approximations

Rank revealing factorizations, and low rank approximations Rank revealing factorizations, and low rank approximations L. Grigori Inria Paris, UPMC January 2018 Plan Low rank matrix approximation Rank revealing QR factorization LU CRTP: Truncated LU factorization

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Saving Energy in Sparse and Dense Linear Algebra Computations

Saving Energy in Sparse and Dense Linear Algebra Computations Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain

More information

Electron Correlation - Methods beyond Hartree-Fock

Electron Correlation - Methods beyond Hartree-Fock Electron Correlation - Methods beyond Hartree-Fock how to approach chemical accuracy Alexander A. Auer Max-Planck-Institute for Chemical Energy Conversion, Mülheim September 4, 2014 MMER Summerschool 2014

More information

Numerical Methods in Matrix Computations

Numerical Methods in Matrix Computations Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices

More information

From Matrix to Tensor. Charles F. Van Loan

From Matrix to Tensor. Charles F. Van Loan From Matrix to Tensor Charles F. Van Loan Department of Computer Science January 28, 2016 From Matrix to Tensor From Tensor To Matrix 1 / 68 What is a Tensor? Instead of just A(i, j) it s A(i, j, k) or

More information

Lazy matrices for contraction-based algorithms

Lazy matrices for contraction-based algorithms Lazy matrices for contraction-based algorithms Michael F. Herbst michael.herbst@iwr.uni-heidelberg.de https://michael-herbst.com Interdisziplinäres Zentrum für wissenschaftliches Rechnen Ruprecht-Karls-Universität

More information

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version

LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Amal Khabou James Demmel Laura Grigori Ming Gu Electrical Engineering and Computer Sciences University of California

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions. March 1, 2013

Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions. March 1, 2013 Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions James Demmel, Andrew Gearhart, Benjamin Lipshitz and Oded Schwartz Electrical Engineering and Computer Sciences University

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

Sparse linear solvers

Sparse linear solvers Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky

More information

Performance optimization of WEST and Qbox on Intel Knights Landing

Performance optimization of WEST and Qbox on Intel Knights Landing Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University

More information

Tensor Network Computations in Quantum Chemistry. Charles F. Van Loan Department of Computer Science Cornell University

Tensor Network Computations in Quantum Chemistry. Charles F. Van Loan Department of Computer Science Cornell University Tensor Network Computations in Quantum Chemistry Charles F. Van Loan Department of Computer Science Cornell University Joint work with Garnet Chan, Department of Chemistry and Chemical Biology, Cornell

More information

Coupled Cluster Theory and Tensor Decomposition Techniques

Coupled Cluster Theory and Tensor Decomposition Techniques Coupled Cluster Theory and Tensor Decomposition Techniques Alexander A. Auer Atomistic Modelling Group Interface Chemistry and Surface Engineering Hierarchy of quantum chemical methods Analysis of orbitals

More information

Homework 2 Foundations of Computational Math 2 Spring 2019

Homework 2 Foundations of Computational Math 2 Spring 2019 Homework 2 Foundations of Computational Math 2 Spring 2019 Problem 2.1 (2.1.a) Suppose (v 1,λ 1 )and(v 2,λ 2 ) are eigenpairs for a matrix A C n n. Show that if λ 1 λ 2 then v 1 and v 2 are linearly independent.

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra By: David McQuilling; Jesus Caban Deng Li Jan.,31,006 CS51 Solving Linear Equations u + v = 8 4u + 9v = 1 A x b 4 9 u v = 8 1 Gaussian Elimination Start with the matrix representation

More information

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization Numerical Methods I Solving Square Linear Systems: GEM and LU factorization Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 18th,

More information

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

Programming models for quantum chemistry applications Jeff Hammond, James Dinan, Edgar Solomonik and Devin Matthews Argonne LCF and MCS, UC Berkeley a

Programming models for quantum chemistry applications Jeff Hammond, James Dinan, Edgar Solomonik and Devin Matthews Argonne LCF and MCS, UC Berkeley a Programming models for quantum chemistry applications, James Dinan, Edgar Solomonik and Devin Matthews Argonne LCF and MCS, UC Berkeley and UTexas 8 May 2012 Abstract (for posterity) Quantum chemistry

More information

I Don t Care about BLAS

I Don t Care about BLAS I Don t Care about BLAS Devin Matthews Institute for Computational Engineering and Sciences, UT Austin #smallmoleculesmatter BLIS Retreat, Sept. 18, 217 1 From QC to DGEMM H ˆR i = E ˆR i Simple eigenproblem

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

A communication-avoiding thick-restart Lanczos method on a distributed-memory system A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart

More information

Quantum Computing Lecture 2. Review of Linear Algebra

Quantum Computing Lecture 2. Review of Linear Algebra Quantum Computing Lecture 2 Review of Linear Algebra Maris Ozols Linear algebra States of a quantum system form a vector space and their transformations are described by linear operators Vector spaces

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

Computation of the mtx-vec product based on storage scheme on vector CPUs

Computation of the mtx-vec product based on storage scheme on vector CPUs BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on

More information

Index. for generalized eigenvalue problem, butterfly form, 211

Index. for generalized eigenvalue problem, butterfly form, 211 Index ad hoc shifts, 165 aggressive early deflation, 205 207 algebraic multiplicity, 35 algebraic Riccati equation, 100 Arnoldi process, 372 block, 418 Hamiltonian skew symmetric, 420 implicitly restarted,

More information

Numerical Linear and Multilinear Algebra in Quantum Tensor Networks

Numerical Linear and Multilinear Algebra in Quantum Tensor Networks Numerical Linear and Multilinear Algebra in Quantum Tensor Networks Konrad Waldherr October 20, 2013 Joint work with Thomas Huckle QCCC 2013, Prien, October 20, 2013 1 Outline Numerical (Multi-) Linear

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra The two principal problems in linear algebra are: Linear system Given an n n matrix A and an n-vector b, determine x IR n such that A x = b Eigenvalue problem Given an n n matrix

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Lecture 13 Stability of LU Factorization; Cholesky Factorization. Songting Luo. Department of Mathematics Iowa State University

Lecture 13 Stability of LU Factorization; Cholesky Factorization. Songting Luo. Department of Mathematics Iowa State University Lecture 13 Stability of LU Factorization; Cholesky Factorization Songting Luo Department of Mathematics Iowa State University MATH 562 Numerical Analysis II ongting Luo ( Department of Mathematics Iowa

More information

APPLIED NUMERICAL LINEAR ALGEBRA

APPLIED NUMERICAL LINEAR ALGEBRA APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation

More information

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b AM 205: lecture 7 Last time: LU factorization Today s lecture: Cholesky factorization, timing, QR factorization Reminder: assignment 1 due at 5 PM on Friday September 22 LU Factorization LU factorization

More information

Molecular properties in quantum chemistry

Molecular properties in quantum chemistry Molecular properties in quantum chemistry Monika Musiał Department of Theoretical Chemistry Outline 1 Main computational schemes for correlated calculations 2 Developmentoftheabinitiomethodsforthe calculation

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

Cpt S 223. School of EECS, WSU

Cpt S 223. School of EECS, WSU Algorithm Analysis 1 Purpose Why bother analyzing code; isn t getting it to work enough? Estimate time and memory in the average case and worst case Identify bottlenecks, i.e., where to reduce time Compare

More information