Cyclops Tensor Framework

Similar documents
Efficient algorithms for symmetric tensor contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Provably efficient algorithms for numerical tensor algebra

Leveraging sparsity and symmetry in parallel tensor contractions

Low Rank Bilinear Algorithms for Symmetric Tensor Contractions

Algorithms as multilinear tensor equations

Strassen-like algorithms for symmetric tensor contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Scalable numerical algorithms for electronic structure calculations

2.5D algorithms for distributed-memory computing

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

A preliminary analysis of Cyclops Tensor Framework

A Distributed Memory Library for Sparse Tensor Functions and Contractions

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

Communication-Avoiding Parallel Algorithms for Dense Linear Algebra and Tensor Computations

Strassen-like algorithms for symmetric tensor contractions

Communication-avoiding parallel algorithms for dense linear algebra

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Communication avoiding parallel algorithms for dense matrix factorizations

Strassen s Algorithm for Tensor Contraction

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Towards an algebraic formalism for scalable numerical algorithms

Scalability and programming productivity for massively-parallel wavefunction methods

Algorithms as Multilinear Tensor Equations

Using BLIS for tensor computations in Q-Chem

Provably Efficient Algorithms for Numerical Tensor Algebra

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Parallel Sparse Tensor Decompositions using HiCOO Format

Software implementation of correlated quantum chemistry methods. Exploiting advanced programming tools and new computer architectures

Parallelization of the QC-lib Quantum Computer Simulator Library

Classical Computer Science and Quantum Computing: High Performance Computing and Quantum Simulation

Algorithms as Multilinear Tensor Equations

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Parallel Numerical Algorithms

Parallelization of the QC-lib Quantum Computer Simulator Library

Algorithms as Multilinear Tensor Equations

Tensor Network Computations in Quantum Chemistry. Charles F. Van Loan Department of Computer Science Cornell University

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

REDISTRIBUTION OF TENSORS FOR DISTRIBUTED CONTRACTIONS

Parallel Transposition of Sparse Data Structures

Contracting symmetric tensors using fewer multiplications

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

c 2016 Society for Industrial and Applied Mathematics

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

DxTer: An Extensible Tool for Optimal Dataflow Program Generation

Finite difference methods. Finite difference methods p. 1

I Don t Care about BLAS

4th year Project demo presentation

Parallelization and benchmarks

Enhancing Scalability of Sparse Direct Methods

NWChem: Coupled Cluster Method (Tensor Contraction Engine)

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Efficient implementation of the overlap operator on multi-gpus

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Dense Arithmetic over Finite Fields with CUMODP

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Parallelization of the Molecular Orbital Program MOS-F

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

Vector Lane Threading

Linear Algebra and Eigenproblems

The next-generation supercomputer and NWP system of the JMA

A practical view on linear algebra tools

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Scikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python

PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers

Lecture 19. Architectural Directions

LU factorization with Panel Rank Revealing Pivoting and its Communication Avoiding version

GPU Computing Activities in KISTI

11 Parallel programming models

Lecture 10: Big-Oh. Doina Precup With many thanks to Prakash Panagaden and Mathieu Blanchette. January 27, 2014

From Matrix to Tensor. Charles F. Van Loan

Parallel Tensor Compression for Large-Scale Scientific Data


Performance Evaluation of the Matlab PCT for Parallel Implementations of Nonnegative Tensor Factorization

High Performance Parallel Tucker Decomposition of Sparse Tensors

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Contents. Preface... xi. Introduction...

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Communication Avoiding LU Factorization using Complete Pivoting

Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

A New Scalable Parallel Algorithm for Fock Matrix Construction

Calculate Sensitivity Function Using Parallel Algorithm

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Architecture Solutions for DDM Numerical Environment

The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Performance optimization of WEST and Qbox on Intel Knights Landing

BOOLEAN MATRIX AND TENSOR DECOMPOSITIONS. Pauli Miettinen TML September 2013

The Performance Evolution of the Parallel Ocean Program on the Cray X1

High Accuracy Local Correlation Methods: Computer Aided Implementation

Transcription:

Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29

Definition of a tensor A rank r tensor is r-dimensional representation of a dataset, for example, a rank one tensor is a vector (e.g. a set of nodes V ) a rank two tensor is a matrix (e.g. a set of edges E in a graph E V V ) Graphically, a rank r tensor is a set of all possible paths P of length r through vertices V P V... V }{{} r-times Alternatively, P may be thought of as a set of hypergraph edges with cardinality r Programmatically, a rank r tensor is an r-dimensional array 2 / 29 Edgar Solomonik Cyclops Tensor Framework 2/ 29

Tensor contractions Given rank 4 tensors T, V, and W we may write perform tensor contraction as W abij = T abkl V klij k l It is common to use raised and lowered index notation, which is sometimes related to the physical meaning of the indices, W ab ij = kl T ab kl V kl ij raised-indices are usually meant to be contracted with lowered indices. 3 / 29 Edgar Solomonik Cyclops Tensor Framework 3/ 29

Tensor folding Since a tensor is a representation of any data set, we may always switch representations define transformation δ p ab to transform a and b into compound index p (δ p ab = 1 when p = a + b n and 0 otherwise) graphically, folding corresponds to replacing edges with vertices (W = V V ) The contraction W ab ij = kl T ab kl V kl ij may be folded into matrix multiplication as follows δ p ab W ab ij δ ij q = r W p q = r δ p ab T ab kl T p r V r q δ kl r δ r kl V kl ij δ ij q 4 / 29 Edgar Solomonik Cyclops Tensor Framework 4/ 29

Reasons to use tensor represenations If all contractions can be folded into matrix multiplication, why use tensors of rank greater than two? permutational index symmetry: the tensors may express higher-dimensional structure expression of many different contractions with a single representation (each may require different folding) finding low-rank tensor decompositions, such as the CP (CANDECOMP/PARAFAC) decomposition T ijk R v ir w jr z kr r This talk will not address low-rank decompositions. 5 / 29 Edgar Solomonik Cyclops Tensor Framework 5/ 29

Application: Coupled Cluster Coupled Cluster (CC) is a numerical approximation scheme to the time-independent many-body Schrödinger equation Ψ = e T 1+T 2 +T 3 +... Φ 0 where T k is a rank 2k ampltiude tensor which correlates sets of k electrons over sets of k basis-functions (captures k-electron excitations) the CCSD method is a truncation at T 1 + T 2 the CCSDT method also includes T 3 The CC methods produce a set of nonlinear equations for the amplitude tensors which are solved iteratively via tensor contractions Given a system of n electrons, the methods require O(n 2k ) memory and O(n 2k+2 ) operations 6 / 29 Edgar Solomonik Cyclops Tensor Framework 6/ 29

Motivation and goals Cyclops (cyclic-operations) Tensor Framework (CTF) provides primitives for distributed memory tensor contractions 7 / 29 Edgar Solomonik Cyclops Tensor Framework 7/ 29

Motivation and goals Cyclops (cyclic-operations) Tensor Framework (CTF) provides primitives for distributed memory tensor contractions takes advantage of thread (two-level) parallelism 7 / 29 Edgar Solomonik Cyclops Tensor Framework 7/ 29

Motivation and goals Cyclops (cyclic-operations) Tensor Framework (CTF) provides primitives for distributed memory tensor contractions takes advantage of thread (two-level) parallelism exposes a simple domain specific language for contractions 7 / 29 Edgar Solomonik Cyclops Tensor Framework 7/ 29

Motivation and goals Cyclops (cyclic-operations) Tensor Framework (CTF) provides primitives for distributed memory tensor contractions takes advantage of thread (two-level) parallelism exposes a simple domain specific language for contractions allows for efficient tensor redistribution and slicing 7 / 29 Edgar Solomonik Cyclops Tensor Framework 7/ 29

Motivation and goals Cyclops (cyclic-operations) Tensor Framework (CTF) provides primitives for distributed memory tensor contractions takes advantage of thread (two-level) parallelism exposes a simple domain specific language for contractions allows for efficient tensor redistribution and slicing exploits permutational tensor symmetry efficiently 7 / 29 Edgar Solomonik Cyclops Tensor Framework 7/ 29

Motivation and goals Cyclops (cyclic-operations) Tensor Framework (CTF) provides primitives for distributed memory tensor contractions takes advantage of thread (two-level) parallelism exposes a simple domain specific language for contractions allows for efficient tensor redistribution and slicing exploits permutational tensor symmetry efficiently uses only MPI, BLAS, and OpenMP and is a library 7 / 29 Edgar Solomonik Cyclops Tensor Framework 7/ 29

Define a parallel world CTF relies on MPI (Message Passing Interface) for multiprocessor parallelism a set of processors in MPI corresponds to a communicator (MPI Comm) MPI COMM WORLD is the default communicators containing all processes CTF World dw(comm) defines an instance of CTF on any MPI communicator 8 / 29 Edgar Solomonik Cyclops Tensor Framework 8/ 29

Define a tensor Consider a rank four tensor T (in CC this is the 2-electron T 2 amplitude) T ab ij where T is m m n n antisymmetric in ab and in ij CTF Tensor T(4,{m,m,n,n},{AS,NS,AS,NS},dw) 9 / 29 Edgar Solomonik Cyclops Tensor Framework 9/ 29

Define a tensor Consider a rank four tensor T (in CC this is the 2-electron T 2 amplitude) T ab ij where T is m m n n antisymmetric in ab and in ij CTF Tensor T(4,{m,m,n,n},{AS,NS,AS,NS},dw) an AS dimension is antisymmetric with the next 9 / 29 Edgar Solomonik Cyclops Tensor Framework 9/ 29

Define a tensor Consider a rank four tensor T (in CC this is the 2-electron T 2 amplitude) T ab ij where T is m m n n antisymmetric in ab and in ij CTF Tensor T(4,{m,m,n,n},{AS,NS,AS,NS},dw) an AS dimension is antisymmetric with the next symmetric (SY) and symmetric-hollow (SH) are also possible 9 / 29 Edgar Solomonik Cyclops Tensor Framework 9/ 29

Define a tensor Consider a rank four tensor T (in CC this is the 2-electron T 2 amplitude) T ab ij where T is m m n n antisymmetric in ab and in ij CTF Tensor T(4,{m,m,n,n},{AS,NS,AS,NS},dw) an AS dimension is antisymmetric with the next symmetric (SY) and symmetric-hollow (SH) are also possible the first dimension of the tensor is mapped linearly onto memory 9 / 29 Edgar Solomonik Cyclops Tensor Framework 9/ 29

Define a tensor Consider a rank four tensor T (in CC this is the 2-electron T 2 amplitude) T ab ij where T is m m n n antisymmetric in ab and in ij CTF Tensor T(4,{m,m,n,n},{AS,NS,AS,NS},dw) an AS dimension is antisymmetric with the next symmetric (SY) and symmetric-hollow (SH) are also possible the first dimension of the tensor is mapped linearly onto memory there are also obvious derived types for CTF Tensor: CTF Matrix, CTF Vector, CTF Scalar 9 / 29 Edgar Solomonik Cyclops Tensor Framework 9/ 29

Contract tensors CTF can express a tensor contraction like Z ab ij = Z ab ij + 2 P(a, b) k F a k T kb ij where P(a, b) implies antisymmetrization of index pair ab, Z[ abij ] += 2.0 F[ ak ] T[ kbij ] 10 / 29 Edgar Solomonik Cyclops Tensor Framework 10/ 29

Contract tensors CTF can express a tensor contraction like Z ab ij = Z ab ij + 2 P(a, b) k F a k T kb ij where P(a, b) implies antisymmetrization of index pair ab, Z[ abij ] += 2.0 F[ ak ] T[ kbij ] P(a, b) is applied implicitly if Z is antisymmetric in ab 10 / 29 Edgar Solomonik Cyclops Tensor Framework 10/ 29

Contract tensors CTF can express a tensor contraction like Z ab ij = Z ab ij + 2 P(a, b) k F a k T kb ij where P(a, b) implies antisymmetrization of index pair ab, Z[ abij ] += 2.0 F[ ak ] T[ kbij ] P(a, b) is applied implicitly if Z is antisymmetric in ab Z, F, T should all be defined on the same world and all processes in the world must call the contraction bulk synchronously 10 / 29 Edgar Solomonik Cyclops Tensor Framework 10/ 29

Contract tensors CTF can express a tensor contraction like Z ab ij = Z ab ij + 2 P(a, b) k F a k T kb ij where P(a, b) implies antisymmetrization of index pair ab, Z[ abij ] += 2.0 F[ ak ] T[ kbij ] P(a, b) is applied implicitly if Z is antisymmetric in ab Z, F, T should all be defined on the same world and all processes in the world must call the contraction bulk synchronously the beginning of the end of all for loops... 10 / 29 Edgar Solomonik Cyclops Tensor Framework 10/ 29

Access and write tensor data CTF takes away your data pointer Access arbitrary sparse subsets of the tensor by global index (coordinate format) 11 / 29 Edgar Solomonik Cyclops Tensor Framework 11/ 29

Access and write tensor data CTF takes away your data pointer Access arbitrary sparse subsets of the tensor by global index (coordinate format) T.write( int indices, double data) (also possible to scale) 11 / 29 Edgar Solomonik Cyclops Tensor Framework 11/ 29

Access and write tensor data CTF takes away your data pointer Access arbitrary sparse subsets of the tensor by global index (coordinate format) T.write( int indices, double data) (also possible to scale) T.read( int indices, double data) (also possible to scale) 11 / 29 Edgar Solomonik Cyclops Tensor Framework 11/ 29

Access and write tensor data CTF takes away your data pointer Access arbitrary sparse subsets of the tensor by global index (coordinate format) T.write( int indices, double data) (also possible to scale) T.read( int indices, double data) (also possible to scale) Matlab submatrix notation: A[j : k, l : m] 11 / 29 Edgar Solomonik Cyclops Tensor Framework 11/ 29

Access and write tensor data CTF takes away your data pointer Access arbitrary sparse subsets of the tensor by global index (coordinate format) T.write( int indices, double data) (also possible to scale) T.read( int indices, double data) (also possible to scale) Matlab submatrix notation: A[j : k, l : m] T. slice ( int offsets, int ends) returns the subtensor 11 / 29 Edgar Solomonik Cyclops Tensor Framework 11/ 29

Access and write tensor data CTF takes away your data pointer Access arbitrary sparse subsets of the tensor by global index (coordinate format) T.write( int indices, double data) (also possible to scale) T.read( int indices, double data) (also possible to scale) Matlab submatrix notation: A[j : k, l : m] T. slice ( int offsets, int ends) returns the subtensor T. slice ( int corner off, int corner end) does the same 11 / 29 Edgar Solomonik Cyclops Tensor Framework 11/ 29

Access and write tensor data CTF takes away your data pointer Access arbitrary sparse subsets of the tensor by global index (coordinate format) T.write( int indices, double data) (also possible to scale) T.read( int indices, double data) (also possible to scale) Matlab submatrix notation: A[j : k, l : m] T. slice ( int offsets, int ends) returns the subtensor T. slice ( int corner off, int corner end) does the same can also sum a subtensor from one tensor with a subtensor of another tensor 11 / 29 Edgar Solomonik Cyclops Tensor Framework 11/ 29

Access and write tensor data CTF takes away your data pointer Access arbitrary sparse subsets of the tensor by global index (coordinate format) T.write( int indices, double data) (also possible to scale) T.read( int indices, double data) (also possible to scale) Matlab submatrix notation: A[j : k, l : m] T. slice ( int offsets, int ends) returns the subtensor T. slice ( int corner off, int corner end) does the same can also sum a subtensor from one tensor with a subtensor of another tensor different subworlds can read different subtensors simultaneously 11 / 29 Edgar Solomonik Cyclops Tensor Framework 11/ 29

Access and write tensor data CTF takes away your data pointer Access arbitrary sparse subsets of the tensor by global index (coordinate format) T.write( int indices, double data) (also possible to scale) T.read( int indices, double data) (also possible to scale) Matlab submatrix notation: A[j : k, l : m] T. slice ( int offsets, int ends) returns the subtensor T. slice ( int corner off, int corner end) does the same can also sum a subtensor from one tensor with a subtensor of another tensor different subworlds can read different subtensors simultaneously Extract a subtensor of any permutation of the tensor 11 / 29 Edgar Solomonik Cyclops Tensor Framework 11/ 29

Access and write tensor data CTF takes away your data pointer Access arbitrary sparse subsets of the tensor by global index (coordinate format) T.write( int indices, double data) (also possible to scale) T.read( int indices, double data) (also possible to scale) Matlab submatrix notation: A[j : k, l : m] T. slice ( int offsets, int ends) returns the subtensor T. slice ( int corner off, int corner end) does the same can also sum a subtensor from one tensor with a subtensor of another tensor different subworlds can read different subtensors simultaneously Extract a subtensor of any permutation of the tensor given mappings P, Q, does B[i, j] = A[P[i], Q[j]] via permute() 11 / 29 Edgar Solomonik Cyclops Tensor Framework 11/ 29

Access and write tensor data CTF takes away your data pointer Access arbitrary sparse subsets of the tensor by global index (coordinate format) T.write( int indices, double data) (also possible to scale) T.read( int indices, double data) (also possible to scale) Matlab submatrix notation: A[j : k, l : m] T. slice ( int offsets, int ends) returns the subtensor T. slice ( int corner off, int corner end) does the same can also sum a subtensor from one tensor with a subtensor of another tensor different subworlds can read different subtensors simultaneously Extract a subtensor of any permutation of the tensor given mappings P, Q, does B[i, j] = A[P[i], Q[j]] via permute() P and Q may access only subsets of A (if B is smaller) 11 / 29 Edgar Solomonik Cyclops Tensor Framework 11/ 29

Access and write tensor data CTF takes away your data pointer Access arbitrary sparse subsets of the tensor by global index (coordinate format) T.write( int indices, double data) (also possible to scale) T.read( int indices, double data) (also possible to scale) Matlab submatrix notation: A[j : k, l : m] T. slice ( int offsets, int ends) returns the subtensor T. slice ( int corner off, int corner end) does the same can also sum a subtensor from one tensor with a subtensor of another tensor different subworlds can read different subtensors simultaneously Extract a subtensor of any permutation of the tensor given mappings P, Q, does B[i, j] = A[P[i], Q[j]] via permute() P and Q may access only subsets of A (if B is smaller) B may be defined on subworlds on the world on which A is defined and each subworld may specify different P and Q 11 / 29 Edgar Solomonik Cyclops Tensor Framework 11/ 29

Extract from CCSD implementation Extracted from Aquarius (Devin Matthews code) FMI["mi"] += 0.5*WMNEF["mnef"]*T(2)["efin"]; WMNIJ["mnij"] += 0.5*WMNEF["mnef"]*T(2)["efij"]; FAE["ae"] -= 0.5*WMNEF["mnef"]*T(2)["afmn"]; WAMEI["amei"] -= 0.5*WMNEF["mnef"]*T(2)["afin"]; Z(2)["abij"] = WMNEF["ijab"]; Z(2)["abij"] += FAE["af"]*T(2)["fbij"]; Z(2)["abij"] -= FMI["ni"]*T(2)["abnj"]; Z(2)["abij"] += 0.5*WABEF["abef"]*T(2)["efij"]; Z(2)["abij"] += 0.5*WMNIJ["mnij"]*T(2)["abmn"]; Z(2)["abij"] -= WAMEI["amei"]*T(2)["ebmj"]; 12 / 29 Edgar Solomonik Cyclops Tensor Framework 12/ 29

Extract from CCSDT implemetnation Extracted from Aquarius (Devin Matthews code) Z(1)["ai"] += 0.25*WMNEF["mnef"]*T(3)["aefimn"]; Z(2)["abij"] += 0.5*WAMEF["bmef"]*T(3)["aefijm"]; Z(2)["abij"] -= 0.5*WMNEJ["mnej"]*T(3)["abeinm"]; Z(2)["abij"] += FME["me"]*T(3)["abeijm"]; Z(3)["abcijk"] = WABEJ["bcek"]*T(2)["aeij"]; Z(3)["abcijk"] -= WAMIJ["bmjk"]*T(2)["acim"]; Z(3)["abcijk"] += FAE["ce"]*T(3)["abeijk"]; Z(3)["abcijk"] -= FMI["mk"]*T(3)["abcijm"]; Z(3)["abcijk"] += 0.5*WABEF["abef"]*T(3)["efcijk"]; Z(3)["abcijk"] += 0.5*WMNIJ["mnij"]*T(3)["abcmnk"]; Z(3)["abcijk"] -= WAMEI["amei"]*T(3)["ebcmjk"]; 13 / 29 Edgar Solomonik Cyclops Tensor Framework 13/ 29

Coupled Cluster on IBM BlueGene/Q CCSD up to 55 water molecules with cc-pvdz CCSDT up to 10 water molecules with cc-pvdz Teraflops 1024 512 256 128 64 32 16 8 Weak scaling on BlueGene/Q Aquarius-CTF CCSD Aquarius-CTF CCSDT 4 512 1024 2048 4096 8192 16384 32768 #nodes Gigaflops/node 60 50 40 30 20 10 Weak scaling on BlueGene/Q Aquarius-CTF CCSD Aquarius-CTF CCSDT 512 1024 2048 4096 8192 16384 32768 #nodes 14 / 29 Edgar Solomonik Cyclops Tensor Framework 14/ 29

Coupled Cluster on Cray XC30 Edison CCSD up to 50 water molecules with cc-pvdz CCSDT up to 10 water molecules with cc-pvdz Teraflops 512 256 128 64 32 16 8 4 2 Weak scaling on Edison Aquarius-CTF CCSD Aquarius-CTF CCSDT 1 32 64 128 256 512 1024 2048 4096 #nodes Gigaflops/node 350 300 250 200 150 100 50 Weak scaling on Edison Aquarius-CTF CCSD Aquarius-CTF CCSDT 32 64 128 256 512 1024 2048 4096 #nodes 15 / 29 Edgar Solomonik Cyclops Tensor Framework 15/ 29

Comparison with NWChem NWChem is a distributed-memory quantum chemistry method suite provides CCSD and CCSDT uses Global Arrays a Partitioned Global Address Space (PGAS) backend for tensor contractions derivation automatically done by Tensor Contraction Engine (TCE) CCSD performance on Edison (thanks to Jeff Hammond for building NWChem and collecting data) NWChem 40 water molecules on 1024 nodes: 44 min 16 / 29 Edgar Solomonik Cyclops Tensor Framework 16/ 29

Comparison with NWChem NWChem is a distributed-memory quantum chemistry method suite provides CCSD and CCSDT uses Global Arrays a Partitioned Global Address Space (PGAS) backend for tensor contractions derivation automatically done by Tensor Contraction Engine (TCE) CCSD performance on Edison (thanks to Jeff Hammond for building NWChem and collecting data) NWChem 40 water molecules on 1024 nodes: 44 min CTF 40 water molecules on 1024 nodes: 9 min 16 / 29 Edgar Solomonik Cyclops Tensor Framework 16/ 29

NWChem approach to contractions A high-level description of NWChem s algorithm for tensor contractions: data layout is abstracted away by the Global Arrays framework Global Arrays uses one-sided communication for data movement packed tensors are stored in blocks for each contraction, each process does a subset of the block contractions each block is transposed and unpacked prior to contraction dynamic load balancing is employed among processors 17 / 29 Edgar Solomonik Cyclops Tensor Framework 17/ 29

CTF approach to contractions A high-level description of CTF s algorithm for tensor contractions: packed tensors are decomposed cyclically among toroidal processor grids MPI collectives are used for all communication for each contraction, a distributed layout is selected based on internal performance models performance model considers all possible execution paths before contraction, tensors are redistributed to a new layout if there is enough memory, the tensors are (partially) unpacked all preserved symmetries and non-symmetric indices are folded in preparation for matrix multiplication nested distributed matrix multiply algorithms are used to perform the contraction in a load-balanced manner 18 / 29 Edgar Solomonik Cyclops Tensor Framework 18/ 29

Input via sparse tensor reads and writes In CTF, tensors are defined on a communicator (subset or full set of processors) the data pointer is hidden from the user the user can perform block-synchronous bulk writes and reads of index-value pairs to avoid communication, the user may read the current local pairs it is possible to perform overlapped writes (accumulate) CTF internal implementation (all parts threaded): 1 bin keys by processor and redistribute 2 bin key by virtual processor and then sort them 3 iterate over the dense tensor, reading or writing keys along the way 4 return keys to originating location if its a sparse read 19 / 29 Edgar Solomonik Cyclops Tensor Framework 19/ 29

Blocked vs block-cyclic vs cyclic decompositions Blocked layout Block-cyclic layout Cyclic layout Green denotes fill (unique values) Red denotes padding / load imbalance 20 / 29 Edgar Solomonik Cyclops Tensor Framework 20/ 29

Virtualization Matrix multiply on 2x3 processor grid. Red lines represent virtualized part of processor grid. Elements assigned to blocks by cyclic phase. A B X = C 21 / 29 Edgar Solomonik Cyclops Tensor Framework 21/ 29

3D tensor mapping 22 / 29 Edgar Solomonik Cyclops Tensor Framework 22/ 29

The mapping process Do in parallel over all physical topologies (foldings of the original torus) 1 map longest physical torus dimension to longest tensor dimension and repeat 23 / 29 Edgar Solomonik Cyclops Tensor Framework 23/ 29

The mapping process Do in parallel over all physical topologies (foldings of the original torus) 1 map longest physical torus dimension to longest tensor dimension and repeat 2 select virtualization factors to preserve symmetry (as well as to match the algorithmic requirements) 23 / 29 Edgar Solomonik Cyclops Tensor Framework 23/ 29

The mapping process Do in parallel over all physical topologies (foldings of the original torus) 1 map longest physical torus dimension to longest tensor dimension and repeat 2 select virtualization factors to preserve symmetry (as well as to match the algorithmic requirements) 3 calculate the necessary memory usage and communication cost of the algorithm 23 / 29 Edgar Solomonik Cyclops Tensor Framework 23/ 29

The mapping process Do in parallel over all physical topologies (foldings of the original torus) 1 map longest physical torus dimension to longest tensor dimension and repeat 2 select virtualization factors to preserve symmetry (as well as to match the algorithmic requirements) 3 calculate the necessary memory usage and communication cost of the algorithm 4 consider whether and what type of redistribution is necessary for the mapping 23 / 29 Edgar Solomonik Cyclops Tensor Framework 23/ 29

The mapping process Do in parallel over all physical topologies (foldings of the original torus) 1 map longest physical torus dimension to longest tensor dimension and repeat 2 select virtualization factors to preserve symmetry (as well as to match the algorithmic requirements) 3 calculate the necessary memory usage and communication cost of the algorithm 4 consider whether and what type of redistribution is necessary for the mapping 5 select the best mapping based on a performance model 23 / 29 Edgar Solomonik Cyclops Tensor Framework 23/ 29

Redistribution amongst different mappings CTF must migrate tensors between different mappings between operations as well as for slice() the key idea is to use the fact that the global ordering of the tensor values is preserved to avoid formation of explicit key-value pairs 24 / 29 Edgar Solomonik Cyclops Tensor Framework 24/ 29

Redistribution amongst different mappings CTF must migrate tensors between different mappings between operations as well as for slice() the key idea is to use the fact that the global ordering of the tensor values is preserved to avoid formation of explicit key-value pairs iterate over local sub-tensor in linear order and bin keys to determine counts 24 / 29 Edgar Solomonik Cyclops Tensor Framework 24/ 29

Redistribution amongst different mappings CTF must migrate tensors between different mappings between operations as well as for slice() the key idea is to use the fact that the global ordering of the tensor values is preserved to avoid formation of explicit key-value pairs iterate over local sub-tensor in linear order and bin keys to determine counts iterate over local piece of old tensor in global order and place keys into bins 24 / 29 Edgar Solomonik Cyclops Tensor Framework 24/ 29

Redistribution amongst different mappings CTF must migrate tensors between different mappings between operations as well as for slice() the key idea is to use the fact that the global ordering of the tensor values is preserved to avoid formation of explicit key-value pairs iterate over local sub-tensor in linear order and bin keys to determine counts iterate over local piece of old tensor in global order and place keys into bins MPI all-to-all-v 24 / 29 Edgar Solomonik Cyclops Tensor Framework 24/ 29

Redistribution amongst different mappings CTF must migrate tensors between different mappings between operations as well as for slice() the key idea is to use the fact that the global ordering of the tensor values is preserved to avoid formation of explicit key-value pairs iterate over local sub-tensor in linear order and bin keys to determine counts iterate over local piece of old tensor in global order and place keys into bins MPI all-to-all-v iterate over local piece of new tensor in global order and retrieve keys from bins 24 / 29 Edgar Solomonik Cyclops Tensor Framework 24/ 29

Redistribution amongst different mappings CTF must migrate tensors between different mappings between operations as well as for slice() the key idea is to use the fact that the global ordering of the tensor values is preserved to avoid formation of explicit key-value pairs iterate over local sub-tensor in linear order and bin keys to determine counts iterate over local piece of old tensor in global order and place keys into bins MPI all-to-all-v iterate over local piece of new tensor in global order and retrieve keys from bins kernel is threaded according to a global tensor partitioning 24 / 29 Edgar Solomonik Cyclops Tensor Framework 24/ 29

Transposition of a tensor on a virtual processor grid In some cases, it is necessary to change the assignment of the tensor dimensions to virtual grid dimensions without changing the virtual processor grid itself in this case, CTF does not touch data within each block redistributed by block instead use MPI Isend and MPI Irecv for each sent and received block 25 / 29 Edgar Solomonik Cyclops Tensor Framework 25/ 29

Local transposition Once the data is redistributed into the new mapping, we fold the tensors locally within blocks turns all non-symmetric block contractions into matrix multiplication preserved symmetries may be folded into one dimension, but broken ones cannot maps dimensions which have symmetry that cannot be folded into matrix multiplication to have the longest stride 26 / 29 Edgar Solomonik Cyclops Tensor Framework 26/ 29

Distributed contraction Once the tensors are distributed accordingly, the contraction algorithm begins 1 replicate small tensors over some processor grid dimensions (2.5D/3D matrix multiplication algorithms) 2 nested distributed SUMMA (2D matrix multiplication algorithm) 3 call to iterate over virtual blocks 4 call to iterate over broken symmetric dimensions 5 call to DGEMM 27 / 29 Edgar Solomonik Cyclops Tensor Framework 27/ 29

Rewind Cyclops Tensor Framework is available at ctf.cs.berkeley.edu and github.com/solomonik/ctf 28 / 29 Edgar Solomonik Cyclops Tensor Framework 28/ 29

Rewind Cyclops Tensor Framework is available at ctf.cs.berkeley.edu and github.com/solomonik/ctf Latest version: v1.1 (March 2014), v1.0 was released in December 2013, development started in June 2011 28 / 29 Edgar Solomonik Cyclops Tensor Framework 28/ 29

Rewind Cyclops Tensor Framework is available at ctf.cs.berkeley.edu and github.com/solomonik/ctf Latest version: v1.1 (March 2014), v1.0 was released in December 2013, development started in June 2011 CCSD(T) and CCSDT(Q) methods in development 28 / 29 Edgar Solomonik Cyclops Tensor Framework 28/ 29

Collaborators and acknowledgements Collaborators: Grants: Devin Matthews, UT Austin (contributions to CTF, teaching me CC, and development of Aquarius on top of CTF) Jeff Hammond, Argonne National Laboratory (initiated project, provides continuing advice, and runs NWChem for me when my patience runs out) James Demmel and Kathy Yelick, UC Berkeley (advising) Krell DOE Computational Science Graduate Fellowship 29 / 29 Edgar Solomonik Cyclops Tensor Framework 29/ 29