Algorithms as Multilinear Tensor Equations
|
|
- Baldwin Black
- 5 years ago
- Views:
Transcription
1 Algorithms as Multilinear Tensor Equations Edgar Solomonik Department of Computer Science ETH Zurich Georgia Institute of Technology, Atlanta GA, USA February 18, 2016 Edgar Solomonik Algorithms as Multilinear Tensor Equations 1/40
2 Pervasive paradigms in scientific computing What commonalities exist among resource-intensive computations in simulation and data analysis? multidimensional datasets (observations, discretizations) higher-order relations between datasets, i.e. equations, maps, graphs, hypergraphs sparsity and symmetry in structure of relations relations lead to solution directly or by acting as an evolutionary (iterative) criterion algebraic descriptions of datasets and relations Edgar Solomonik Algorithms as Multilinear Tensor Equations 2/40
3 Pervasive paradigms in scientific computing What type of abstractions are desirable in high performance computing? data abstractions should reflect native dimensionality and structure global functional abstractions should efficiently orchestrate communication and synchronization abstractions should enable development of provably efficient algorithms Edgar Solomonik Algorithms as Multilinear Tensor Equations 3/40
4 Outline Introduction to tensor computations Symmetry-preserving tensor algorithms Communication-avoiding parallel algorithms A massively-parallel tensor framework Applications to electronic structure calculations Conclusion Edgar Solomonik Algorithms as Multilinear Tensor Equations 4/40
5 Basic data (vector) parallelism Edgar Solomonik Algorithms as Multilinear Tensor Equations 5/40
6 Basic data (vector) parallelism c i = f (a i ) Edgar Solomonik Algorithms as Multilinear Tensor Equations 6/40
7 Basic data (vector) parallelism c i = f (a i ) Edgar Solomonik Algorithms as Multilinear Tensor Equations 6/40
8 Dense matrix operators c = Ab Edgar Solomonik Algorithms as Multilinear Tensor Equations 7/40
9 Sparse matrix operators c = Ab Edgar Solomonik Algorithms as Multilinear Tensor Equations 8/40
10 Sparse matrix operators c i = j A ij b j Edgar Solomonik Algorithms as Multilinear Tensor Equations 8/40
11 Low-rank matrix factorizations A = UV c = UVb Edgar Solomonik Algorithms as Multilinear Tensor Equations 9/40
12 Low-rank matrix factorizations A ij = k U ik V kj c i = j,k U ik V kj b j Edgar Solomonik Algorithms as Multilinear Tensor Equations 9/40
13 Low-rank matrix factorizations c = UΣVb Edgar Solomonik Algorithms as Multilinear Tensor Equations 9/40
14 Tensor operator d i = j,k A ijk b j c k Edgar Solomonik Algorithms as Multilinear Tensor Equations 9/40
15 Tensor operator d i = j,k A ijk b j c k Edgar Solomonik Algorithms as Multilinear Tensor Equations 10/40
16 Low-rank tensor factorizations (CP) A ijk = l U il V jl W kl d i = j,k,l U il V jl W kl b j c k Edgar Solomonik Algorithms as Multilinear Tensor Equations 11/40
17 Low-rank tensor factorizations (CP) A ijk = l U il V jl W kl d i = j,k,l U il V jl W kl b j c k Edgar Solomonik Algorithms as Multilinear Tensor Equations 11/40
18 Low-rank tensor factorizations (Tucker) A ijk = T lmn U il V jm W kn d i = T lmn U il V jm W jn b j c k l,m,n j,k,l,m,n Edgar Solomonik Algorithms as Multilinear Tensor Equations 12/40
19 Low-rank tensor factorizations (Tucker) A jk i = T lmn U il V jm W kn d i = T lmn U il V jm W kn b j c k l,m,n j,k,l,m,n Edgar Solomonik Algorithms as Multilinear Tensor Equations 12/40
20 Low-rank tensor factorizations (TT) A ijk = l,m U il V ljm W mk U il V ljm W mk b j c k j,k,l,m Edgar Solomonik Algorithms as Multilinear Tensor Equations 13/40
21 Low-rank tensor factorizations (TT) A ijk = l,m U il V ljm W mk U il V ljm W mk b j c k j,k,l,m Edgar Solomonik Algorithms as Multilinear Tensor Equations 13/40
22 Tensor contractions in electronic structure methods Amplitude equation snippet from coupled cluster doubles model 0 = V abij + T abik F kj + T abkl V klmn T mnij +... k k,l,m,n Tensor hypercontraction representation a T abij = l,m ψ al φ jl Z lm φ jm ψ bm a Hohenstein, Parrish, Sherrill, Martinez, JCP, 2012 Edgar Solomonik Algorithms as Multilinear Tensor Equations 14/40
23 Structured tensor computations Challenges for matrix/tensor algebraic abstractions data and relation sparsity tensor sparsity low-order representations of data tensor decompositions implicitly defined relations implicit tensor representations data and relation equivalences tensor symmetries Edgar Solomonik Algorithms as Multilinear Tensor Equations 15/40
24 Exploiting symmetry in tensors Tensor symmetry (e.g. A ij = A ji ) reduces memory and cost for order d tensor, d! less memory dot product i,j A ijb ij = 2 i<j A ijb ij + i A iib ii matrix-vector multiplication 1 c i = A ij b j = ( A ij (b i + b j ) A ij )b i j j j rank-2 vector outer product 1 C ij = a i b j + a j b i = (a i + a j )(b i + b j ) a i b i a j b j squaring a symmetric matrix (or AB + BA) 1 C ij = k A ik A kj = k (A ik + A kj + A ij ) 2... for order ω contraction, ω! fewer multiplications 1 1 S., Demmel; Technical Report, ETH Zurich, Edgar Solomonik Algorithms as Multilinear Tensor Equations 16/40
25 Symmetry preserving algorithms By exploiting symmetry, we can reduce the number of multiplications at the cost of more additions 2 partially symmetric contractions symmetry preserving algorithm can be nested over each index group reduction in multiplications implies reduction in nested calls cost reductions: 1.3 for CCSD, 2.1 for CCSDT algorithms generalize to most antisymmetric tensor contractions for Hermitian tensors, multiplication cost 3X more than addition BLAS routines: hemm and her2k as well as LAPACK routines like hetrd (tridiagonal reduction) may be done with 25% fewer operations achieves (2/3)n 3 bilinear rank for squaring a nonsymmetric matrix, assuming elementwise commutativity allows blocking of symmetric contractions into smaller (anti)symmetric contractions 2 S., Demmel; Technical Report, ETH Zurich, Edgar Solomonik Algorithms as Multilinear Tensor Equations 17/40
26 Beyond computation cost Algorithms should be not only work-efficient but communication-efficient data movement and synchronization cost more energy than flops two types of data movement: vertical (intranode memory cache) and horizontal (internode network transfers) parallel algorithm design involves tradeoffs between computation, communication, and synchronization lower bounds and parameterized algorithms provide optimal solutions within a well-defined tuning space Edgar Solomonik Algorithms as Multilinear Tensor Equations 18/40
27 Cost model for parallel algorithms Given a schedule that specifies all work and communication tasks on p processors, we consider the following costs, accumulated along chains of tasks (as in α β, BSP, and LogGP models), F computation cost Q vertical communication cost W horizontal communication cost S synchronization cost Edgar Solomonik Algorithms as Multilinear Tensor Equations 19/40
28 Communication lower bounds: previous work Multiplication of n n matrices horizontal communication lower bound 3 W MM = Ω ( n 2 p 2/3 ) memory-dependent ( ) horizontal communication lower bound 4 W MM = Ω n 3 p M with M = cn 2 /p memory, can hope to obtain W = O(n 2 / cp) communication standard parallel libraries (ScaLAPACK, Elemental) optimal only for c = 1 3 Aggarwal, Chandra, Snir, TCS, Irony, Toledo, Tiskin, JPDC, 2004 Edgar Solomonik Algorithms as Multilinear Tensor Equations 20/40
29 Communication-efficient matrix multiplication Communication-optimal algorithms for matrix multiplication have been studied extensively 5 They continue to be attractive on modern architectures 6 Percentage of machine peak D MM on BG/P (n=65,536) 2.5D SUMMA 2D SUMMA ScaLAPACK PDGEMM Gigaflop/s/node Matrix multiplication strong scaling on Mira (BG/Q) 2.5D MM n=65,536 2D MM n=65, D MM n=16,384 2D MM n=16, #nodes #nodes 5 Berntsen, Par. Comp., 1989; Agarwal, Chandra, Snir, TCS, 1990; Agarwal, Balle, Gustavson, Joshi, Palkar, IBM, 1995; McColl, Tiskin, Algorithmica, 1999;... 6 S., Bhatele, Demmel, SC, 2011 Edgar Solomonik Algorithms as Multilinear Tensor Equations 21/40
30 Synchronization cost lower bounds Unlike matrix multiplication, many algorithms in numerical linear algebra have polynomial depth synchronization cost bounds 7 S MM = Θ ( n 3 pm 3/2 ) algorithms for Cholesky, LU, QR, SVD have additional dependencies lowering computation and communication costs, requires additional synchronization 7 Ballard, Demmel, Holtz, Schwartz, SIAM JMAA, 2011 Edgar Solomonik Algorithms as Multilinear Tensor Equations 22/40
31 Tradeoffs in the diamond DAG For the n n diamond DAG, there is a tradeoff between computation and synchronization costs 8 F S = Ω(n 2 ) We generalize such tradeoffs to consider horizontal communication and arbitrary (polynomial or exponential) interval expansion 9 8 Papadimitriou, Ullman, SIAM JC, S., Carson, Knight, Demmel, SPAA 2014 (extended version, JPDC 2016) Edgar Solomonik Algorithms as Multilinear Tensor Equations 23/40
32 Tradeoffs involving synchronization We apply tradeoff lower bounds to dense linear algebra algorithms, represented via dependency hypergraphs. a For triangular solve with an n n matrix For Cholesky of an n n matrix F TRSV S TRSV = Ω ( n 2) F CHOL S 2 CHOL = Ω ( n 3) W CHOL S CHOL = Ω ( n 2) Proof employs classical Loomis-Whitney inequality. For any R N N N, three projections of R onto N N have total size at least R 2/3 a S., Carson, Knight, Demmel, SPAA 2014 (extended version, JPDC 2016) Edgar Solomonik Algorithms as Multilinear Tensor Equations 24/40
33 Communication-efficient LU factorization For any c [1, p 1/3 ], use cn 2 /p memory per processor and obtain W LU = O(n 2 / cp), S LU = O( cp) Percentage of machine peak LU with tournament pivoting on BG/P (n=65,536) ideal scaling 2.5D LU 2D LU ScaLAPACK PDGETRF #nodes LU with pairwise pivoting 10 extended to tournament pivoting 11 first implementation of a communication-optimal LU algorithm 10 Tiskin, FGCS, S., Demmel, Euro-Par, 2011 Edgar Solomonik Algorithms as Multilinear Tensor Equations 25/40
34 Communication-efficient QR factorization W QR = O(n 2 / cp), S QR = O( cp) using Givens rotations 12 Householder form can be reconstructed quickly from TSQR 13 optimal QR communication and synchronization (modulo log factors) costs can be obtained with Householder representation 14 Householder aggregation yields performance improvements QR weak scaling on Hopper (15K-by-15K to 131K-by-131K) Teraflops Two-Level CAQR-HR Two-Level Householder CAQR-HR Elemental QR ScaLAPACK QR Scatter-Apply CAQR Binary-Tree CAQR #cores 12 Tiskin, FGCS, Ballard, Demmel, Grigori, Jacquelin, Nguyen, Diep, S., IPDPS, S., UCB, 2014 Edgar Solomonik Algorithms as Multilinear Tensor Equations 26/40
35 Communication-efficient eigenvalue computation For the dense symmetric matrix eigenvalue problem W SE = O(n 2 / cp), S QR = O( cp log 2 p) 15 optimal horizontal communication can be obtained with left-looking algorithm and aggregation, however, requires more vertical communication successive band reduction can be used to minimize both communication costs 15 S., UCB, S., Hoefler, Demmel, in preparation Edgar Solomonik Algorithms as Multilinear Tensor Equations 27/40
36 Synchronization tradeoffs in stencils Our lower bound analysis extends also to sparse iterative methods. 16 For computing s applications of a (2m + 1) d -point stencil F St S d St = Ω ( m 2d s d+1) W St S d 1 St = Ω (m d s d) proof requires generalization of Loomis-Whitney inequality to order d set and order d 1 projections time-blocking lowers synchronization and vertical communication costs, but raises horizontal communication we suggest alternative approach that minimizes vertical and horizontal communication, but not synchronization 16 S., Carson, Knight, Demmel, SPAA 2014 (extended version, JPDC 2016) Edgar Solomonik Algorithms as Multilinear Tensor Equations 28/40
37 Beyond the Loomis-Whitney inequalities Loomis-Whitney inequalities are not sufficient for all computations symmetry preserving tensor contraction algorithms have arbitrary order projections from order d set bilinear algorithms 17 provide a more general framework a bilinear algorithm is defined by matrices F (A), F (B), F (C), c = F (C) [(F (A)T a) (F (B)T b)] where is the Hadamard (pointwise) product communication lower bounds can be formulated in terms of rank Pan, Springer, S., Hoefler, Demmel, in preparation Edgar Solomonik Algorithms as Multilinear Tensor Equations 29/40
38 Communication cost of symmetry preserving algorithms For contraction of order s + v tensor with order v + t tensor Υ is the nonsymmetric contraction algorithm Ψ is the best previously known algorithm Φ is the symmetry preserving algorithm Asymptotic communication lower bounds based on bilinear expansion a (H cache size, p #processors, n dimension): s t v F Υ F Ψ F Φ Q Υ,Ψ Q Φ W Υ W Ψ W Φ n 2 n 2 n2 2 n 2 n 2 n n 3 n3 n n 4 n n 3 n 3 n n 4 n n 6 n6 8 p 1/2 n p 1/2 n p 1/2 6 n 3 n 3 n n 2 n 2 p 2/3 p 2/3 n 4 24 n 4 n 4 n2 n 2 n 2 p 1/2 p 1/2 p 1/2 6 n 4 24 n n 3 H 1/2 n 3 H 1/2 n 2 p 2/3 n 2 p 2/3 n 2 p 2/3 n 4 n 4 n 2 n 2 n3 H 1/2 H 1/3 p 3/4 n 6 n 6 n 4 n 4 n 4 H 1/2 H 1/2 p 2/3 p 2/3 p 2/3 a S., Hoefler, Demmel, ETHZ, 2014 Edgar Solomonik Algorithms as Multilinear Tensor Equations 30/40
39 Open theoretical problems lower bounds for multiplication of a sparse and a dense matrix lower bounds for nested bilinear algorithms broader parameterizations of algorithmic representations needed for QR and SVD lower bounds Edgar Solomonik Algorithms as Multilinear Tensor Equations 31/40
40 Tensor algebra as a programming abstraction Cyclops Tensor Framework 19 contraction/summation/functions of tensors distributed symmetric-packed/sparse storage via cyclic layout parallelization via MPI+OpenMP(+CUDA) 19 S., Hammond, Demmel, UCB, S., Matthews, Hammond, Demmel, IPDPS, 2013 Edgar Solomonik Algorithms as Multilinear Tensor Equations 32/40
41 Tensor algebra as a programming abstraction Cyclops Tensor Framework contraction/summation/functions of tensors distributed symmetric-packed/sparse storage via cyclic layout parallelization via MPI+OpenMP(+CUDA) Jacobi iteration example code snippet void Jacobi ( Matrix <> A, Vector <> b, int n){ Matrix <> R(A); R["ii"] = 0.0; Vector <> x(n), d(n), r(n); Function <> inv ([]( double & d){ return 1./d; }); d["i"] = inv (A["ii"]); // set d to inverse of diagonal of A do { x["i"] = d["i"]*(b["i"]-r["ij"]*x["j" ]); r["i"] = b["i"]-a["ij"]*x["j"]; // compute residual } while ( r. norm2 () > 1. E -6); // check for convergence } Edgar Solomonik Algorithms as Multilinear Tensor Equations 32/40
42 Tensor algebra as a programming abstraction Cyclops Tensor Framework contraction/summation/functions of tensors distributed symmetric-packed/sparse storage via cyclic layout parallelization via MPI+OpenMP(+CUDA) Møller-Plesset perturbation theory (MP3) code snippet Z[" abij "] += Fab ["af"]*t[" fbij "]; Z[" abij "] -= Fij ["ni"]*t[" abnj "]; Z[" abij "] += 0.5 * Vabcd [" abef "]*T[" efij "]; Z[" abij "] += 0.5 * Vijkl [" mnij "]*T[" abmn "]; Z[" abij "] -= Vaibj [" amei "]*T[" ebmj "]; Edgar Solomonik Algorithms as Multilinear Tensor Equations 32/40
43 Betweenness centrality Betweenness centrality code snippet, for k of n nodes void btwn_central ( Matrix <int > A, Matrix <path > P, int n, int k){ Monoid <path > mon (..., []( path a, path b){ if ( a.w <b. w) return a; else if ( b.w <a. w) return b; else return path (a.w, a.m+b.m); },... ); Matrix < path > Q(n,k, mon ); // shortest path matrix Q["ij"] = P["ij"]; Function <int,path > append ([]( int w, path p){ return path (w+p.w, p.m); }; ); } for ( int i =0; i<n; i ++) Q["ij"] = append (A["ik"],Q["kj"]);... Edgar Solomonik Algorithms as Multilinear Tensor Equations 33/40
44 Performance of CTF for dense computations CTF is highly tuned for massively-parallel machines virtualized multidimensional processor grids topology-aware mapping and collective communication performance-model-driven decomposition done at runtime optimized redistribution kernels for tensor transposition Teraflop/s CTF Scalapack BG/Q matrix multiplication #cores Edgar Solomonik Algorithms as Multilinear Tensor Equations 34/40
45 Performance of CTF for sparse computations seconds/iteration MP3 leveraging sparse-dense tensor contractions a Weak scaling of MP3 (m=40, n=160 on 24 cores) dense 16% sparse 8% sparse 4% sparse 2% sparse 28 1% sparse #cores seconds/iteration Strong scaling of MP3 with m=40, n=160 dense 16% sparse 8% sparse 4% sparse 2% sparse 1% sparse #cores All-pairs shortest-paths based on path doubling with sparsification seconds Weak scaling of APSP (n=2k on 24 cores) regular path doubling sparse path doubling seconds Strong scaling of APSP with n=2k regular path doubling sparse path doubling #cores #cores Edgar Solomonik Algorithms as Multilinear Tensor Equations 35/40
46 Post-Hartree-Fock (HF) methods Accurate models of electronic correlation require approximation of contributions of excited-state transitions. Møller-Plesset methods provide perturbative corrections Coupled-cluster methods (CCSD, CCSDT, CCSDTQ) iteratively solve (2nd, 3rd, 4th) order equations in the state space tensor expressions naturally express high-order transitions tensor structure admits symmetries and sparsity permutational index antisymmetry due to antisymmetry of wavefunction sparsity due to strength of interactions diminishing with growing distance in the molecular orbital basis Edgar Solomonik Algorithms as Multilinear Tensor Equations 36/40
47 CCSD using CTF Extracted from Aquarius (Devin Matthews code, FMI ["mi"] += 0.5 * WMNEF [" mnef "]*T (2)[ " efin "]; WMNIJ [" mnij "] += 0.5 * WMNEF [" mnef "]*T (2)[ " efij "]; FAE ["ae"] -= 0.5 * WMNEF [" mnef "]*T (2)[ " afmn "]; WAMEI [" amei "] -= 0.5 * WMNEF [" mnef "]*T (2)[ " afin "]; Z (2)[ " abij "] = WMNEF [" ijab "]; Z (2)[ " abij "] += FAE ["af"]*t (2)[ " fbij "]; Z (2)[ " abij "] -= FMI ["ni"]*t (2)[ " abnj "]; Z (2)[ " abij "] += 0.5 * WABEF [" abef "]*T (2)[ " efij "]; Z (2)[ " abij "] += 0.5 * WMNIJ [" mnij "]*T (2)[ " abmn "]; Z (2)[ " abij "] -= WAMEI [" amei "]*T (2)[ " ebmj "]; Other electronic structure codes using CTF include QChem (via Libtensor) and VASP Edgar Solomonik Algorithms as Multilinear Tensor Equations 37/40
48 Comparison with NWChem NWChem is the most commonly-used distributed-memory quantum chemistry method suite provides CCSD and CCSDT uses Global Arrays a Partitioned Global Address Space (PGAS) backend for tensor contractions derives equations via Tensor Contraction Engine (TCE) Strong scaling CCSD on Edison Strong scaling CCSDT on Edison seconds NWChem w20 w15 w10 w8 Aquarius-CTF w20 w15 w10 w8 seconds NWChem w3 w2 Aquarius-CTF w4 w3 w #nodes #nodes Edgar Solomonik Algorithms as Multilinear Tensor Equations 38/40
49 Coupled cluster on IBM BlueGene/Q and Cray XC30 CCSD up to 55 (50) water molecules with cc-pvdz CCSDT up to 10 water molecules with cc-pvdz a Teraflops Weak scaling on BlueGene/Q Aquarius-CTF CCSD Aquarius-CTF CCSDT #nodes Gigaflops/node Weak scaling on BlueGene/Q Aquarius-CTF CCSD Aquarius-CTF CCSDT #nodes Teraflops Weak scaling on Edison Aquarius-CTF CCSD Aquarius-CTF CCSDT Gigaflops/node #nodes #nodes Edgar Solomonik Algorithms as Multilinear Tensor Equations 39/ Weak scaling on Edison Aquarius-CTF CCSD Aquarius-CTF CCSDT
50 Future work symmetry preserving algorithms high-performance implementations bilinear algorithm complexity fast matrix multiplication sparsity in tensor computations handling multiple sparse operands and sparse output worst-case lower bounds and efficient algorithms tensor algorithms most algorithms correspond to multiple dependent tensors operations scheduling, blocking, and decomposition of multiple tensor operations programming abstractions for tensor factorizations application-driven development tensor decompositions, sparsity, symmetry all motivated by electronic structure applications many further applications in tensor networks (DMRG), machine learning, etc. Edgar Solomonik Algorithms as Multilinear Tensor Equations 40/40
51 Backup slides Edgar Solomonik Algorithms as Multilinear Tensor Equations 41/40
52 Symmetry preserving algorithm vs Strassen s algorithm [n ω /(s!t!v!)] / #multiplications (speed-up over classical direct evaluation alg.) Symmetry preserving alg. vs Strassen s alg. (s=t=v=ω/3) Strassen s algorithm Sym. preserving ω=6 Sym. preserving ω= e+06 n s /s! (matrix dimension) Edgar Solomonik Algorithms as Multilinear Tensor Equations 42/40
53 Nesting of bilinear algorithms Given two bilinear algorithms: Λ 1 =(F (A) 1, F(B) 1, F(C) 1 ) Λ 2 =(F (A) 2, F(B) 2, F(C) 2 ) We can nest them by computing their tensor product Λ 1 Λ 2 :=(F (A) 1 F (A) 2, F(B) 1 F (B) 2, F(C) 1 F (C) 2 ) rank(λ 1 Λ 2 ) = rank(λ 1 ) rank(λ 2 ) Edgar Solomonik Algorithms as Multilinear Tensor Equations 43/40
54 Block-cyclic algorithm for s-step methods For s-steps of a (2m + 1) d -point stencil with block-size of H 1/d /m, ( ) msn d W Kr = O H 1/d p S Kr = O(sn d /(ph)) ( ) msn d Q Kr = O H 1/d p which are good when H = Θ(n d /p), so the algorithm is useful when the cache size is a bit smaller than n d /p Edgar Solomonik Algorithms as Multilinear Tensor Equations 44/40
55 2.5D LU on MIC Gigaflop/s/node LU factorization strong scaling on Stampede (MIC + Sandy Bridge) 2.5D hybrid LU n=131,072 2D hybrid LU n=131, D pure-cpu LU n=131, D hybrid LU n=65,536 2D hybrid LU n=65, D pure-cpu LU n=65, #nodes Edgar Solomonik Algorithms as Multilinear Tensor Equations 45/40
56 Topology-aware mapping on BG/Q LU factorization strong scaling on Mira (BG/Q), n=65,536 2D LU, custom mapping 2D LU, default mapping Gigaflop/s/node #nodes Edgar Solomonik Algorithms as Multilinear Tensor Equations 46/40
57 Symmetric matrix representation Edgar Solomonik Algorithms as Multilinear Tensor Equations 47/40
58 Blocked distributions of a symmetric matrix Edgar Solomonik Algorithms as Multilinear Tensor Equations 48/40
59 Cyclic distribution of a symmetric matrix Edgar Solomonik Algorithms as Multilinear Tensor Equations 49/40
60 Our CCSD factorization Credit to John F. Stanton and Jurgen Gauss τ ab ij = t ab ij Pa b Pi j t a i t b j, F m e = f m e + fn v mn ef t f n, F a e = (1 δ ae )f a e m F i m = (1 δ mi )fi m + e F m e t a m 1 2 F m e t e i mnf nef vef mn tmn af + fn vef mn tin ef + fn v an ef tf n, v mn if t f n, Edgar Solomonik Algorithms as Multilinear Tensor Equations 50/40
61 Our CCSD factorization W ei mn = vei mn + f v mn ef t f i, W ij mn = vij mn + Pj i W ie am = vie am n W ij am = vij am + Pj i zi a = fi a m 1 2 emn zij ab = vij ab + Pj i + P a b e e e v mn ie t e j W ei mn tn a + f v am ie t e j F i m tm a + e W ei mn tmn, ea e F a e t eb ij v ab ie t e j P i j f a e t e i + P a bp i j m ef vef mn τij ef, v ma ef t f i ef vef am τij ef, + em me F i m tmj ab nf vef mn tin af, vei ma tm e + v ae F im e m em W ie am tmj eb Pb a m ef v ab ef τ ef ij W ij am tm b mn efm W ij mn τmn, ab vef am τim ef Edgar Solomonik Algorithms as Multilinear Tensor Equations 51/40
62 Stability of symmetry preserving algorithms Edgar Solomonik Algorithms as Multilinear Tensor Equations 52/40
63 Performance breakdown on BG/Q Performance data for a CCSD iteration with 200 electrons and 1000 orbitals on 4096 nodes of Mira 4 processes per node, 16 threads per process Total time: 18 mins v-orbitals, o-electrons kernel % of time complexity architectural bounds DGEMM 45% O(v 4 o 2 /p) flops/mem bandwidth broadcasts 20% O(v 4 o 2 /p M) multicast bandwidth prefix sum 10% O(p) allreduce bandwidth data packing 7% O(v 2 o 2 /p) integer ops all-to-all-v 7% O(v 2 o 2 /p) bisection bandwidth tensor folding 4% O(v 2 o 2 /p) memory bandwidth Edgar Solomonik Algorithms as Multilinear Tensor Equations 53/40
64 Tiskin s path doubling algorithm Tiskin gives a way to do path-doubling in F = O(n 3 /p) operations. We can partition each A k by path size (number of edges) A k = I A k (1) A k (2)... A k (k) where each A k (l) contains the shortest paths of up to k l edges, which have exactly l edges. We can see that A l (l) A l+1 (l)... A n (l) = A (l), in particular A (l) corresponds to a sparse subset of A l (l). The algorithm works by picking l [k/2, k] and computing (I A) 3k/2 (I A k (l)) A k, which finds all paths of size up to 3k/2 by taking all paths of size exactly l k/2 followed by all paths of size up to k. Edgar Solomonik Algorithms as Multilinear Tensor Equations 54/40
Algorithms as multilinear tensor equations
Algorithms as multilinear tensor equations Edgar Solomonik Department of Computer Science ETH Zurich Technische Universität München 18.1.2016 Edgar Solomonik Algorithms as multilinear tensor equations
More informationAlgorithms as Multilinear Tensor Equations
Algorithms as Multilinear Tensor Equations Edgar Solomonik Department of Computer Science ETH Zurich University of Colorado, Boulder February 23, 2016 Edgar Solomonik Algorithms as Multilinear Tensor Equations
More informationAlgorithms as Multilinear Tensor Equations
Algorithms as Multilinear Tensor Equations Edgar Solomonik Department of Computer Science ETH Zurich University of Illinois March 9, 2016 Algorithms as Multilinear Tensor Equations 1/36 Pervasive paradigms
More informationLeveraging sparsity and symmetry in parallel tensor contractions
Leveraging sparsity and symmetry in parallel tensor contractions Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CCQ Tensor Network Workshop Flatiron Institute,
More informationA Distributed Memory Library for Sparse Tensor Functions and Contractions
A Distributed Memory Library for Sparse Tensor Functions and Contractions Edgar Solomonik University of Illinois at Urbana-Champaign February 27, 2017 Cyclops Tensor Framework https://github.com/solomonik/ctf
More informationLow Rank Bilinear Algorithms for Symmetric Tensor Contractions
Low Rank Bilinear Algorithms for Symmetric Tensor Contractions Edgar Solomonik ETH Zurich SIAM PP, Paris, France April 14, 2016 1 / 14 Fast Algorithms for Symmetric Tensor Contractions Tensor contractions
More informationTowards an algebraic formalism for scalable numerical algorithms
Towards an algebraic formalism for scalable numerical algorithms Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June 22, 2017 Householder
More informationCyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions
Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign Householder Symposium XX June
More informationProvably efficient algorithms for numerical tensor algebra
Provably efficient algorithms for numerical tensor algebra Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley Dissertation Defense Adviser: James Demmel Aug 27, 2014 Edgar Solomonik
More informationScalability and programming productivity for massively-parallel wavefunction methods
Scalability and programming productivity for massively-parallel wavefunction methods Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign American Chemical Society
More informationA communication-avoiding parallel algorithm for the symmetric eigenvalue problem
A communication-avoiding parallel algorithm for the symmetric eigenvalue problem Edgar Solomonik 1, Grey Ballard 2, James Demmel 3, Torsten Hoefler 4 (1) Department of Computer Science, University of Illinois
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationCyclops Tensor Framework
Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationStrassen-like algorithms for symmetric tensor contractions
Strassen-like algorithms for symmetric tensor contractions Edgar Solomonik Theory Seminar University of Illinois at Urbana-Champaign September 18, 2017 1 / 28 Fast symmetric tensor contractions Outline
More informationCommunication-avoiding parallel algorithms for dense linear algebra
Communication-avoiding parallel algorithms 1/ 49 Communication-avoiding parallel algorithms for dense linear algebra Edgar Solomonik Department of EECS, C Berkeley June 2013 Communication-avoiding parallel
More information2.5D algorithms for distributed-memory computing
ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster
More informationCommunication-Avoiding Parallel Algorithms for Dense Linear Algebra and Tensor Computations
Introduction Communication-Avoiding Parallel Algorithms for Dense Linear Algebra and Tensor Computations Edgar Solomonik Department of EECS, C Berkeley February, 2013 Edgar Solomonik Communication-avoiding
More informationScalable numerical algorithms for electronic structure calculations
Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster
More informationStrassen-like algorithms for symmetric tensor contractions
Strassen-lie algorithms for symmetric tensor contractions Edgar Solomoni University of Illinois at Urbana-Champaign Scientfic and Statistical Computing Seminar University of Chicago April 13, 2017 1 /
More informationTradeoffs between synchronization, communication, and work in parallel linear algebra computations
Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel Department of EECS, UC Berkeley February,
More informationProvably Efficient Algorithms for Numerical Tensor Algebra
Provably Efficient Algorithms for Numerical Tensor Algebra Edgar Solomonik Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2014-170 http://www.eecs.berkeley.edu/pubs/techrpts/2014/eecs-2014-170.html
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationAvoiding Communication in Distributed-Memory Tridiagonalization
Avoiding Communication in Distributed-Memory Tridiagonalization SIAM CSE 15 Nicholas Knight University of California, Berkeley March 14, 2015 Joint work with: Grey Ballard (SNL) James Demmel (UCB) Laura
More informationCyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions
Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik Devin Matthews Jeff Hammond James Demmel Electrical Engineering and Computer
More informationTradeoffs between synchronization, communication, and work in parallel linear algebra computations
Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik Erin Carson Nicholas Knight James Demmel Electrical Engineering and Computer Sciences
More informationClassical Computer Science and Quantum Computing: High Performance Computing and Quantum Simulation
Classical Computer Science and Quantum Computing: High Performance Computing and Quantum Simulation Edgar Solomonik Department of Computer Science, University of Illinois at Urbana-Champaign NSF Workshop:
More informationContracting symmetric tensors using fewer multiplications
Research Collection Journal Article Contracting symmetric tensors using fewer multiplications Authors: Solomonik, Edgar; Demmel, James W. Publication Date: 2016 Permanent Link: https://doi.org/10.3929/ethz-a-010345741
More informationUsing BLIS for tensor computations in Q-Chem
Using BLIS for tensor computations in Q-Chem Evgeny Epifanovsky Q-Chem BLIS Retreat, September 19 20, 2016 Q-Chem is an integrated software suite for modeling the properties of molecular systems from first
More informationCyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions
Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik Devin Matthews Jeff Hammond James Demmel Electrical Engineering and Computer
More informationSoftware implementation of correlated quantum chemistry methods. Exploiting advanced programming tools and new computer architectures
Software implementation of correlated quantum chemistry methods. Exploiting advanced programming tools and new computer architectures Evgeny Epifanovsky Q-Chem Septermber 29, 2015 Acknowledgments Many
More informationIntroduction to communication avoiding linear algebra algorithms in high performance computing
Introduction to communication avoiding linear algebra algorithms in high performance computing Laura Grigori Inria Rocquencourt/UPMC Contents 1 Introduction............................ 2 2 The need for
More informationLU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version
1 LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version Amal Khabou Advisor: Laura Grigori Université Paris Sud 11, INRIA Saclay France SIAMPP12 February 17, 2012 2
More informationCS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform
CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform Edgar Solomonik University of Illinois at Urbana-Champaign September 21, 2016 Fast
More informationA Computation- and Communication-Optimal Parallel Direct 3-body Algorithm
A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More informationAvoiding Communication in Linear Algebra
Avoiding Communication in Linear Algebra Grey Ballard Microsoft Research Faculty Summit July 15, 2014 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation,
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 5 Eigenvalue Problems Section 5.1 Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael
More informationIntroduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra
Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationLU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU
LU Factorization To further improve the efficiency of solving linear systems Factorizations of matrix A : LU and QR LU Factorization Methods: Using basic Gaussian Elimination (GE) Factorization of Tridiagonal
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationA preliminary analysis of Cyclops Tensor Framework
A preliminary analysis of Cyclops Tensor Framework Edgar Solomonik Jeff Hammond James Demmel Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2012-29
More informationStrassen s Algorithm for Tensor Contraction
Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,
More informationBindel, Fall 2016 Matrix Computations (CS 6210) Notes for
1 Logistics Notes for 2016-08-26 1. Our enrollment is at 50, and there are still a few students who want to get in. We only have 50 seats in the room, and I cannot increase the cap further. So if you are
More informationDense LU factorization and its error analysis
Dense LU factorization and its error analysis Laura Grigori INRIA and LJLL, UPMC February 2016 Plan Basis of floating point arithmetic and stability analysis Notation, results, proofs taken from [N.J.Higham,
More informationA practical view on linear algebra tools
A practical view on linear algebra tools Evgeny Epifanovsky University of Southern California University of California, Berkeley Q-Chem September 25, 2014 What is Q-Chem? Established in 1993, first release
More informationPracticality of Large Scale Fast Matrix Multiplication
Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by
More informationCommunication-avoiding LU and QR factorizations for multicore architectures
Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationBLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8
More informationCommunication Lower Bounds for Programs that Access Arrays
Communication Lower Bounds for Programs that Access Arrays Nicholas Knight, Michael Christ, James Demmel, Thomas Scanlon, Katherine Yelick UC-Berkeley Scientific Computing and Matrix Computations Seminar
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationParallel Eigensolver Performance on High Performance Computers
Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization
More informationMinimizing Communication in Linear Algebra
inimizing Communication in Linear Algebra Grey Ballard James Demmel lga Holtz ded Schwartz Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2009-62
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationRank revealing factorizations, and low rank approximations
Rank revealing factorizations, and low rank approximations L. Grigori Inria Paris, UPMC January 2018 Plan Low rank matrix approximation Rank revealing QR factorization LU CRTP: Truncated LU factorization
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationSaving Energy in Sparse and Dense Linear Algebra Computations
Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortí, V. Roca Univ. Politécnica Univ. Jaume I The Univ. of Texas de Valencia, Spain
More informationElectron Correlation - Methods beyond Hartree-Fock
Electron Correlation - Methods beyond Hartree-Fock how to approach chemical accuracy Alexander A. Auer Max-Planck-Institute for Chemical Energy Conversion, Mülheim September 4, 2014 MMER Summerschool 2014
More informationNumerical Methods in Matrix Computations
Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices
More informationFrom Matrix to Tensor. Charles F. Van Loan
From Matrix to Tensor Charles F. Van Loan Department of Computer Science January 28, 2016 From Matrix to Tensor From Tensor To Matrix 1 / 68 What is a Tensor? Instead of just A(i, j) it s A(i, j, k) or
More informationLazy matrices for contraction-based algorithms
Lazy matrices for contraction-based algorithms Michael F. Herbst michael.herbst@iwr.uni-heidelberg.de https://michael-herbst.com Interdisziplinäres Zentrum für wissenschaftliches Rechnen Ruprecht-Karls-Universität
More informationLU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version
LU Factorization with Panel Rank Revealing Pivoting and Its Communication Avoiding Version Amal Khabou James Demmel Laura Grigori Ming Gu Electrical Engineering and Computer Sciences University of California
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationLower Bounds on Algorithm Energy Consumption: Current Work and Future Directions. March 1, 2013
Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions James Demmel, Andrew Gearhart, Benjamin Lipshitz and Oded Schwartz Electrical Engineering and Computer Sciences University
More informationNumerical Methods I Non-Square and Sparse Linear Systems
Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant
More informationSparse linear solvers
Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky
More informationPerformance optimization of WEST and Qbox on Intel Knights Landing
Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University
More informationTensor Network Computations in Quantum Chemistry. Charles F. Van Loan Department of Computer Science Cornell University
Tensor Network Computations in Quantum Chemistry Charles F. Van Loan Department of Computer Science Cornell University Joint work with Garnet Chan, Department of Chemistry and Chemical Biology, Cornell
More informationCoupled Cluster Theory and Tensor Decomposition Techniques
Coupled Cluster Theory and Tensor Decomposition Techniques Alexander A. Auer Atomistic Modelling Group Interface Chemistry and Surface Engineering Hierarchy of quantum chemical methods Analysis of orbitals
More informationHomework 2 Foundations of Computational Math 2 Spring 2019
Homework 2 Foundations of Computational Math 2 Spring 2019 Problem 2.1 (2.1.a) Suppose (v 1,λ 1 )and(v 2,λ 2 ) are eigenpairs for a matrix A C n n. Show that if λ 1 λ 2 then v 1 and v 2 are linearly independent.
More informationNumerical Linear Algebra
Numerical Linear Algebra By: David McQuilling; Jesus Caban Deng Li Jan.,31,006 CS51 Solving Linear Equations u + v = 8 4u + 9v = 1 A x b 4 9 u v = 8 1 Gaussian Elimination Start with the matrix representation
More informationNumerical Methods I Solving Square Linear Systems: GEM and LU factorization
Numerical Methods I Solving Square Linear Systems: GEM and LU factorization Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 18th,
More informationEnhancing Scalability of Sparse Direct Methods
Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.
More informationProgramming models for quantum chemistry applications Jeff Hammond, James Dinan, Edgar Solomonik and Devin Matthews Argonne LCF and MCS, UC Berkeley a
Programming models for quantum chemistry applications, James Dinan, Edgar Solomonik and Devin Matthews Argonne LCF and MCS, UC Berkeley and UTexas 8 May 2012 Abstract (for posterity) Quantum chemistry
More informationI Don t Care about BLAS
I Don t Care about BLAS Devin Matthews Institute for Computational Engineering and Sciences, UT Austin #smallmoleculesmatter BLIS Retreat, Sept. 18, 217 1 From QC to DGEMM H ˆR i = E ˆR i Simple eigenproblem
More informationJ.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009
Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.
More informationScientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix
Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008
More informationLinear Algebra and Eigenproblems
Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details
More informationA communication-avoiding thick-restart Lanczos method on a distributed-memory system
A communication-avoiding thick-restart Lanczos method on a distributed-memory system Ichitaro Yamazaki and Kesheng Wu Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract. The Thick-Restart
More informationQuantum Computing Lecture 2. Review of Linear Algebra
Quantum Computing Lecture 2 Review of Linear Algebra Maris Ozols Linear algebra States of a quantum system form a vector space and their transformations are described by linear operators Vector spaces
More informationSparse BLAS-3 Reduction
Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc
More informationComputation of the mtx-vec product based on storage scheme on vector CPUs
BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix Computation of the mtx-vec product based on storage scheme on
More informationIndex. for generalized eigenvalue problem, butterfly form, 211
Index ad hoc shifts, 165 aggressive early deflation, 205 207 algebraic multiplicity, 35 algebraic Riccati equation, 100 Arnoldi process, 372 block, 418 Hamiltonian skew symmetric, 420 implicitly restarted,
More informationNumerical Linear and Multilinear Algebra in Quantum Tensor Networks
Numerical Linear and Multilinear Algebra in Quantum Tensor Networks Konrad Waldherr October 20, 2013 Joint work with Thomas Huckle QCCC 2013, Prien, October 20, 2013 1 Outline Numerical (Multi-) Linear
More informationNumerical Linear Algebra
Numerical Linear Algebra The two principal problems in linear algebra are: Linear system Given an n n matrix A and an n-vector b, determine x IR n such that A x = b Eigenvalue problem Given an n n matrix
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationLecture 13 Stability of LU Factorization; Cholesky Factorization. Songting Luo. Department of Mathematics Iowa State University
Lecture 13 Stability of LU Factorization; Cholesky Factorization Songting Luo Department of Mathematics Iowa State University MATH 562 Numerical Analysis II ongting Luo ( Department of Mathematics Iowa
More informationAPPLIED NUMERICAL LINEAR ALGEBRA
APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation
More informationLU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b
AM 205: lecture 7 Last time: LU factorization Today s lecture: Cholesky factorization, timing, QR factorization Reminder: assignment 1 due at 5 PM on Friday September 22 LU Factorization LU factorization
More informationMolecular properties in quantum chemistry
Molecular properties in quantum chemistry Monika Musiał Department of Theoretical Chemistry Outline 1 Main computational schemes for correlated calculations 2 Developmentoftheabinitiomethodsforthe calculation
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More informationCpt S 223. School of EECS, WSU
Algorithm Analysis 1 Purpose Why bother analyzing code; isn t getting it to work enough? Estimate time and memory in the average case and worst case Identify bottlenecks, i.e., where to reduce time Compare
More information