Contracting symmetric tensors using fewer multiplications

Size: px

Start display at page:

Download "Contracting symmetric tensors using fewer multiplications"

Ophelia Hawkins
5 years ago
Views:

Research Collection Journal Article Contracting symmetric tensors using fewer multiplications Authors: Solomonik, Edgar; Demmel, James W. Publication Date: 2016 Permanent Link: https://doi.org/10.

1 Research Collection Journal Article Contracting symmetric tensors using fewer multiplications Authors: Solomonik, Edgar; Demmel, James W. Publication Date: 2016 Permanent Link: Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library

2 Contracting Symmetric Tensors Using Fewer Multiplications Edgar Solomonik Department of Computer Science, ETH Zürich, Switzerland James Demmel Department of EECS and Department of Mathematics, University of California, Berkeley, USA Abstract We present more computationally-efficient algorithms for contracting symmetric tensors. Tensor contractions are reducible to matrix multiplication, but permutational symmetries of the data, which are expressed by the tensor representation, provide an opportunity for more efficient algorithms. Previously known methods have exploited only tensor symmetries that yield identical computations that are directly evident in the contraction expression. We present a new symmetry preserving algorithm that uses an algebraic reorganization in order to exploit considerably more symmetry in the computation of the contraction than the conventional approach. The new algorithm requires fewer multiplications but more additions per multiplication than previous approaches. The applications of this result include the capability to multiply a symmetric matrix by a vector, as well as compute the rank-2 symmetric vector outer product in half the number of scalar multiplications, albeit with more additions. The symmetry preserving algorithm can also be adapted to perform the complex versions of these operations, namely the product of a Hermitian matrix and a vector and the rank-2 Hermitian vector outer product, in 3/4 of the overall operations. Consequently, the number of operations needed for the direct algorithm to compute the eigenvalues of a Hermitian matrix is reduced by the same factor. Our symmetry preserving tensor contraction algorithm can also be adapted to the antisymmetric case and is therefore applicable to the tensor-contraction computations employed in quantum chemistry. For these applications, notably the coupled-cluster method, our algorithm yields the highest potential speed-ups, since in many higher-order contractions the reduction in the number of multiplications achieved by our algorithm enables an equivalent reduction in overall contraction cost. We highlight that for three typical coupled-cluster contractions taken from methods of three different orders, our algorithm achieves 2X, 4X, and 9X improvements in arithmetic cost over the standard approach. Keywords: tensor contractions, symmetric tensors, matrix multiplication, symmetric matrices, coupled cluster 2010 MSC: 15A59, 15B57 1. Introduction How many scalar multiplications are necessary to multiply a n-by-n symmetric matrix by a vector? Commonly, n 2 scalar multiplications are performed, since the symmetrically equivalent entries of the matrix are multiplied by different vector entries. This paper considers a method for such symmetric-matrix-by-vector multiplication that performs only n 2 /2 scalar multiplications to leading order, but at the overhead of extra addition operations. More generally, we give an algorithm for any fully-symmetrized contraction of n-dimensional symmetric tensors of order s + v and v + t, yielding a result of order s + t with a total of ω = s + v + t indices that uses n ω := n+ω 1 ω n ω /ω! multiplications to leading order instead of the usual n n n s t v n ω /s!t!v!. This cost reduction arises from exploiting symmetry in all ω indices involved in the contraction, while in the direct evaluation method, only the symmetry in index groups which are preserved three groups of s, v, and t indices in the contraction equation is exploited. Our symmetry preserving algorithm forms different intermediate quantities, which all have symmetric structure, by accumulating some redundant terms, which are then cancelled out via low-order computations. solomonik@inf.ethz.ch demmel@cs.berkeley.edu Preprint submitted to Linear Algebra and its Applications January 10, 2015

3 The technique we introduce is reminiscent of Gauss s trick, which reduces the number of scalar multiplications from 4 to 3 for complex multiplication, but requires more additions. It is also similar to fast matrix multiplication algorithms such as Strassen s [1] and more closely to simultaneous matrix multiplication AB, CD discussed by Victor Pan [2]. What all of these methods have in common is the use of algebraic reorganizations of the stated problem which form intermediate quantities with a preferable computational structure. The symmetry preserving tensor contraction algorithm presented in this paper similarly forms intermediate tensors which include mathematically redundant subterms in order to maintain symmetric structure in the overall computation. Because the symmetry preserving algorithm requires fewer multiplications than additions, it is most beneficial when each multiplication costs more than an addition, which is the case when each tensor entry is a complex number or a matrix/tensor. This tensor contraction algorithm also has several special cases relevant for numerical matrix computations, including some of the Basic Linear Algebra Subroutines BLAS [3]. In particular, symv, symm, syr2, and syr2k can be done with half the multiplications. Furthermore, in complex arithmetic, the Hermitian matrix BLAS routines hemm, her2k and LAPACK routines such as hetrd can be done in 3/4 of the arithmetic operations overall. While numerically stable, the reformulated algorithms may incur somewhat larger numerical error for certain inputs and have slightly weaker error bounds. We begin the paper by presenting the matrix-vector and matrix-matrix cases of the symmetry preserving tensor contraction algorithm in Section 2. These cases correspond to s, t, v 1, and are special cases of the symmetrized tensor contraction problem C = A v B, which sums permutations of the result of the nonsymmetric tensor contraction C = A v B, so that C is symmetric these along with other notation for tensors are defined in Section 3. Throughout the paper, we focus on comparing two algorithms for the symmetrized tensor contraction problem, the direct evaluation algorithm Ψ s,t,v Φ s,t,v A, B given in general form in Section 4 and the new symmetry preserving A, B, which is given in general form and analyzed in Section 5. The matrix algorithms we present in Section 2 are special cases of the tensor algorithms Ψ s,t,v A, B and Φ s,t,v A, B. We present a proof of correctness of the new algorithm, a proof of numerical stability, and a cost analysis. The symmetry preserving tensor contraction algorithm also generalizes to the case of antisymmetric tensors when the value of the tensor changes sign with interchange of its indices. Further, for the antisymmetrized contraction A, B of the symmetry preserving tensor contraction algorithm. Some combinations of symmetrized or antisymmetrized contractions of symmetric or antisymmetric tensors always result in a zero result for certain values of s, t, v and another subset of the cases is incompatible with the symmetry preserving tensor contraction algorithm. We categorize all of the possible scenarios in Table 6.1 at the start of Section 6. Curiously, the cases for which the symmetry preserving tensor contraction algorithm may be used also seem to be the most useful for applications and extensions. We employ all the combinations of antisymmetric/symmetric tensor contractions which are compatible with the symmetry preserving tensor contraction algorithm to give an algorithm for squaring a nonsymmetric matrix in two-thirds of the usual number of multiplications, as well as to provide a symmetry preserving contraction algorithm for Hermitian tensors. The general form of the symmetry preserving algorithm is complicated as are the proofs of correctness of the symmetric and antisymmetric adaptations. Therefore, we verified the correctness of our algorithms and the symmetric/antisymmetric structure of all intermediate tensors for an extensive range of s, t, v values by implementing a prototype of the general algorithms using Cyclops Tensor Framework [4]. We leave high performance implementations as future work. Lastly, we demonstrate that the direct evaluation and symmetry preserving contraction algorithms can be nested for tensors that are partially symmetric. In such partially symmetric cases, the reduction in the number of multiplications provided by the symmetry preserving yields a similar reduction in the number of total operations, since the scalar multiplications become tensor contractions while the scalar additions become tensor additions and the former can generally be asymptotically more expensive than the latter. The resulting nested contraction algorithms are well-suited for tensor computations in quantum chemistry, especially coupled-cluster [5, 6]. We demonstrate that the symmetry preserving yields substantial speed-ups in the leading order number of operations for three sample contractions 2X, 4X, and 9X for contraction selected from CCSD [7], CCSDT [8], and CCSDTQ [9], respectively. The paper is organized as follows: C = A v B where the result C is antisymmetric, we give an adaptation Φ s,t,v Section 2 presents new algorithms for symmetric matrix and vector computations, that lower the number of multiplications with respect to the direct evaluation approaches. Section 3 introduces our notation for arbitrary order index sets and defines the general form of the symmetric 2

4 tensor contraction problem. Section 4 gives the direct evaluation algorithms for nonsymmetric and symmetric tensor contractions of arbitrary order Υ s,t,v A, B and Ψ s,t,v A, B, respectively. Section 5 presents the new symmetry preserving algorithm for symmetric tensor contractions, Φ s,t,v A, B, then proves its correctness, stability, and cost complexity. Section 6 illustrates how the algorithm Φ s,t,v A, B can be adapted into Φ s,t,v A, B to handle antisymmetric tensors, and shows how the two can be used to compute Hermitian tensor contractions. Section 8 demonstrates how the new contraction algorithms can be nested and uses nested forms to provide accelerated algorithms for a few sample coupled-cluster contractions. Our overall conclusion is that the symmetry preserving tensor contraction algorithm has potential for practical acceleration of several numerical algorithms, particularly those that involve high order partially symmetric contractions 2. Symmetry Preserving Algorithms for Matrix and Vector Computations To give an intuitive understanding of the general symmetric tensor contraction algorithm, as well as to cover some basic applications, we begin by giving a few special cases of it for matrix computations. We consider multiplication of symmetric matrix with a vector s = 1, v = 1, t = 0, the symmetric rank-2 outer product s = 1, v = 0, t = 1, and the symmetrized multiplication of symmetric matrices, which is known as Jordan ring [10] symmetric matrix multiplication s = v = t = 1. In this section, we consider all vector/matrix/tensor entries to be elements of a possibly nonassociative ring over the set R with operations ϱ = +, the costs of which are ν ϱ, µ ϱ respectively. We denote vectors, matrices, and tensors in bold font e.g. a, A, but not elements thereof e.g. a i, A ij. In this section, vectors will usually be denoted using lowercase letters, but in later sections tensors of variable order will be denoted using uppercase letters. Throughout this section, we illustrate notation for the nonsymmetric contraction s,t,v [ ]A, B A v B and symmetrized contraction s,t,v [ ]A, B A v B, which are defined for arbitrary order tensors A of order s + v, B of order v + t, and C of order s + t in Section 3. We also employ the notation and define special cases of the nonsymmetric tensor contraction algorithm Υ s,t,v A, B and the direct evaluation symmetric tensor contraction algorithm Ψ s,t,v A, B, which are fully defined in Section 4 and correspond simply to evaluation of all unique multiplications in A v B and A v B, respectively. We also define special cases of the symmetry preserving contraction algorithm Φ s,t,v A, B, which is defined for arbitrary order tensors in Section 5. We use juxtaposition AB to denote the vector inner product, vector outer product, matrix-vector multiplication, and matrix-matrix multiplication, as well as to scale a tensor by an integer or a fraction, e.g. ka scales all elements of A by k Multiplication of a Symmetric Matrix by a Vector Consider two vectors of n elements b and c as well as an n-by-n symmetric matrix A. We seek to compute c = 1,0,1 [ ]A, b := A 1 b := i [1, n], c i = n A ik b k. 2.1 The direct evaluation symmetric contraction algorithm Ψ 1,0,1 A, b simply treats A as nonsymmetric and computes all the multiplications in 2.1. In fact, since the result is not symmetrized A 1 b = A 1 b, and the direct evaluation symmetric algorithm is equivalent to the nonsymmetric algorithm, Ψ 1,0,1 A, b Υ 1,0,1 A, b. The cost of the direct evaluation algorithm is Fsymvϱ, Ψ n = µ ϱ n 2 + ν ϱ n 2. Our symmetry preserving algorithm for 2.1 employs a simple algebraic reorganization, c = Φ 1,0,1 A, b := i [1, n], j [i, n], Ẑ ij = A ij b i + b j, Z i = A 1 i = n k=1 k=1 i Ẑ ki + k=1 n k=i+1 Ẑ ik, A ik, V i = A 1 i b i, c i = Z i V i

5 Since elements A ij and b i + b j are symmetric in permutation of i with j, the elements Ẑij = A ij b i + b j are also symmetric in i, j. Thus, Z i = n k=1 Ẑik can be computed in n 2 /2 multiplications to leading order. The formulation in 2.2 computes and accesses only the lower-triangle of Ẑ, however, in later formulations within this paper, we will simply state a given tensor is symmetric, compute the unique elements e.g. the upper triangle of Ẑ, and access the entire tensor. The cost of Φ 1,0,1 A, b is 1 Fsymvϱ, Φ n = µ ϱ 2 n n + ν ϱ 2 n n Now, if we consider multiplication of A by K vectors equivalently by an n-by-k matrix, the cost of computing A 1 is amortized over the K matrix-vector multiplications, yielding the overall cost, 1 Fsymmϱ, Φ n, K = µ ϱ 2 Kn Kn + ν ϱ 2 Kn2 + n Kn The symmetry preserving algorithm for the symm problem may be faster whenever Fsymmϱ, Φ n, K < KFsymvϱ, Ψ n, which is the case whenever µ ϱ > ν ϱ and n, K 1. That condition is true for the cases where the scalars are complex, where µ ϱ 3ν ϱ assuming multiplication of two real numbers costs at least as much as addition. There are two use-cases for the complex-problem, one where the A is complex and symmetric, e.g. the Discrete Fourier Transform DFT matrix, for which case the described algorithm applies without modification outside of the real addition and multiplications becoming complex additions and multiplications, and the second, more common case where A is a Hermitian matrix. The latter case corresponds to the BLAS routine hemm, and our algorithm may be adapted to handle it as described in Section 13. In both cases, the symmetry preserving algorithm yields at least a 4/3 reduction in cost. The symmetry preserving algorithm is also easily adaptable to the case when A is a sparse symmetric matrix with m non-zeros. Instead of the usual m multiplications and m additions, the symmetry preserving algorithm requires 1 2 m multiplications and 5 2 m additions to leading order or 3 2 m leading order additions when computation of A1 may be amortized. However, we note that for certain choices of A and b the symmetry preserving algorithm incurs more error see Section Symmetrized Vector Outer Product We now consider a rank-2 outer product of column vector a and a row vector b of length n to form an n-by-n symmetric matrix C = ab + b T a T, computing C = 1,1,0 [ ]a, b := a 0 b := i, j [1, n], C ij = a i b j + a j b i. 2.3 C = 1,1,0 [ ]a, b = ab + b T a T so long as is commutative. For floating point real numbers 2.3 corresponds to BLAS routine syr2. Usually, the syr2 routine is defined as syr2a, b = ab T + b T a, which is equivalent to our definition i.e. C = syr2a, b T again so long as is commutative. The direct evaluation algorithm Ψ 1,1,0 a, b computes the unique part of C using n 2 scalar multiplications and n 2 scalar additions, so Fsyr2 Ψ Ψ ϱ, n = Fsymvϱ, n. Our symmetry preserving algorithm to compute 2.3 employs the following algebraic reorganization to compute the unique part of C, C = Φ 1,1,0 a, b := i [1, n], j [i, n], Z ij = Ẑij = a i + a j b i + b j, U 1 i = a i b i, W ij = U 1 i + U 1 j, C ij = Z ij W ij. 2.4 The number of multiplications needed by 2.4 is the same as in the previous case of a symmetric matrix multiplied by a vector, since Ẑ is again symmetric and be computed via the n 2 multiplications necessary to form its unique elements. However, the number of additions is slightly less than the symv case, F Φ syr2ϱ, n = µ ϱ 1 2 n n + ν ϱ 2n 2 + 2n. Now we consider the case of a rank-2k outer product where instead of vectors a and b, we have n-by-k matrix A and K-by-n matrix B, which corresponds to the BLAS routine syr2k. In this case, we can compute W with low-order 4

6 cost, yielding a reduction in the number of additions needed. The resulting algorithm has the same leading order cost as symm, 1 Fsyr2K Φ = ϱ, n, K = µ ϱ 2 Kn Kn + ν ϱ 2 Kn2 + n Kn + n Furthermore, to leading order, F Φ syr2k has the same potential speedups over F Ψ syr2k as F Φ symm had over F Ψ symm. This algorithm may be adapted to the case of the Hermitian outer product ab+b a, where denotes the Hermitian adjoint conjugate transpose of the vector. The adaptation of the symmetry preserving algorithm to this operation, which also corresponds to BLAS routine her2 usually defined as ab + ba, is described in a general fashion in Section 13 and performs the same number of multiplication and addition operations of complex numbers as a, b. The extension to her2k has the same operation count as given by Fsyr2K Φ and is therefore 4/3 cheaper in leading order cost than the direct evaluation algorithm for her2k. So, since we now know how to compute hemm and her2k in 3/4 of the operations to leading order, we can compute the reduction to tridiagonal form for the Hermitian eigensolve LAPACK routine hetrd, which can be formulated so that the cost is dominated by these two subroutines, in 3/4 of the usual number of operations to leading order. Φ 1,1, Jordan Ring Symmetric Matrix Multiplication As a last example, we consider the case of multiplying two symmetric matrices and symmetrizing the product C = AB+B T A T, which corresponds to s = t = v = 1, and in which all terms of the general form of Φ 1,1,1 A, B take part. This symmetrized product corresponds to multiplication on a Jordan ring [10] of symmetric matrices. Given symmetric matrices A, B of dimension n, the problem is to compute C = 1,1,1 [ ]A, B := A 1 B := i, j [1, n], C ij = n k=1 A ik B kj + A jk B ki. 2.5 C = 1,1,1 [ ]A, B = AB + B T A T so long as is commutative. The direct evaluation symmetric tensor contraction algorithm Ψ 1,1,1 A, B simply multiplies C = Υ 1,1,1 A, B := AB as nonsymmetric matrices and symmetrizes the product C = C + C T, for a total cost of F Ψ syrmmϱ, n = µ ϱ n 3 + ν ϱ n 3 + Oν ϱ n 2. Φ 1,1,1 A, B yields an algorithm for 2.5 by doing only n 3 /6 leading order multiplications required to compute all unique values of the third order symmetric tensor Ẑ, i [1, n], j [i, n], k [j, n], Ẑ ijk = A ij + A ik + A jk B ij + B ki + B kj. 2.6 Subsequently, the rest of the work can be done in a low-order number of multiplications as follows the formulas below access elements of Ẑ that are not defined in 2.6, but which are all symmetrically equivalent to some element computed by 2.6 when A and B are symmetric, i [1, n], j [i, n], Z ij = n k=1 Z ijk, A 1 i = n k=1 V ij = na ij B ij + A 1 i + A 1 j n U 1 l = k=1 A ik, B 1 i = Bij + A ij n B ki, k=1 B1 i + B 1 j, A lk B kl, W ij = U 1 i + U 1 j, C ij = Z ij V ij W ij. 2.7 The third order tensor Ẑ does not need to be stored all at once, instead its unique elements can be immediately accumulated into Z, or even C, so this symmetry preserving algorithm does not require extra storage. Due to the symmetry in computation of Z ij, and the fact that all other terms are low-order, the cost of Φ 1,1,1 A, B is F Φ syrmmϱ, n = µ ϱ 1 6 n3 + ν ϱ 7 6 n3 + Oµ ϱ + ν ϱ n 2, 5

7 so, to leading order, Φ 1,1,1 A, B requires a factor of 6 fewer multiplications and a factor of 3/2 fewer total operations than Ψ 1,1,1 A, B. However, we note that for certain choices of A and B the symmetry preserving algorithm incurs more error see Section 11. A special case of Jordan ring symmetric matrix multiplication is squaring a symmetric matrix, 1 2 1,1,1[ ]A, A = AA = A 2, where need not be commutative as in the general case. For this case, a typical approach might only exploit the symmetry of the output only computing the unique n 2 entries of the output, performing of n 3 /2 multiplications and additions to leading order, while the symmetry preserving algorithm, A 2 = 1 2 Φ1,1,1 A, A would need n 3 /6 multiplications and 5n 3 /6 additions to leading order since the 2n 3 /6 additions for B may now be avoided. While the total number of operations is the same, a factor of three fewer multiplications are needed by the symmetry preserving algorithm. 3. Tensor Notation and Definitions We now introduce notation for arbitrary order tensor indices and their manipulations, and define the nonsymmetric and symmetric tensor contraction problems. Our notation is similar to that of Kolda and Bader [11], but with a number of additional constructs. Our definitions of tensor contractions are defined in anticipation of the extent of generality of the symmetry-preserving contraction algorithms Section 5 and in compatibility with nested contractions Section 7. Definition 3.1. We denote a d-tuple of positive integers as i d := i 1,... i d. These tuples will be used as tensor indices, and each will typically range from 1 to n, so i d [1, n] d. We concatenate tuples using the notation i d j f := i 1,... i d, j 1,... j f. We refer to the first g < d elements of i d as i g. We employ one-sized tuples i 1 := i 1 = i 1, which are similarly concatenated i d j 1 = i 1,... i d, j 1, as well as zero-sized tuples j 0 =, i d j 0 = i d. Given an order d tensor A, we will refer to its elements using the notation A i d = A i1,...i d. When summing over a all tuples in [1, n] d, rather than writing d summations, we will write simply i d = n i... n 1=1 i d =1, in which the range is i d [1, n]d. Definition 3.2. We refer to the space of increasing d-tuples with values between 1 and n as [1, n] d, which means i d [1, n] d, i 1... i d, and to the space of strictly increasing tuples as [1, n] d which means i d [1, n] d, i 1 <... < i d. Definition 3.3. Let the set of all possible d-dimensional permutation functions be Π d, where each π Π d is associated with a unique bijection ˆπ : [1, d] [1, d], so that πi d := iˆπ1,..., iˆπd. The number of such functions is Π d = d!. We denote the collection of all permutations of a tuple i d as Πi d := πi d : π Π d. We note that the collection of all permutations of a tuple with repeated values will include identical tuples, e.g. Π1, 2, 2 = 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1. Therefore, Πi d = d! for any i d. Definition 3.4. We say an n-dimensional order-d tensor T is symmetric if for all i d [1, n] d and for all π Π d, T i d = T πi d. Definition 3.5. We define the disjoint partition χ p qk r as the collection of all pairs of tuples of size p and q, which are disjoint subcollections of k r and preserve the ordering of elements in k r, in other words if k i and k j appear in the same tuple partition and i < j, then k i must appear before k j. For example, the possible ordered partitions of k 3 = k 1, k 2, k 3 into pairs of tuples of size one and two are the collection χ 1 2k 3 = [ k 1, k 2, k 3, k 2, k 1, k 3, k 3, k 1, k 2 ]. We can formally define the collection χ p qk r inductively, base case for r < p + q: χ p qk r := [], for r = 0 and for p = q = 0: χ p qk r := [, ]. 6

8 inductive case for χ p qk r : if p > 0 and q > 0, χ p qk r := [ i r 1 k r, j r 2 : i r 1, j r 2 χ p 1 q k r 1 ] [ i r 1, j r 2 k r : i r 1, j r 2 χ p q 1 k r 1 ] [ i r 1, j r 2 : i r 1, j r 2 χ p qk r 1 ] if p = 0, χ 0 qk r := [, j q 1 k r :, j q 1 χ 0 q 1k r 1 ] if q = 0, [, j q :, j q χ 0 qk r 1 ] χ p 0 k r :=[ i p 1 k r, : i p 1, χ p 1 0 k r 1 ] [ i p, : i p, χ p 0 k r 1 ] We denote all possible ordered subcollections as χ d k d + f := [i d : i d, j f χ d f k d + f ]. When it is implicitly clear what partition is needed, we omit the superscript and subscript from χ completely, for example i d χk d + f is equivalent to i d χ d k d + f, while i d, j f χk d + f is equivalent to i d, j f χ d f k d + f. We note that for any d, f 0, the union of all concatenations of all possible permutations of the ordered subcollections produced by χ d f is equal to all possible permutations of the partitioned tuple, Πk d + f = [ i d j f : i d, j f χk d + f, i d Πi d, j f Πj f ]. Definition 3.6. For any s, t, v 0 with ω := s + t + v, we denote a tensor contraction between A of order s + v with elements in Abelian group R A, + and B of order t+v with elements in Abelian group R B, +, into C of order s+t with elements in Abelian group R C, + all with dimension/edge-length n, via any operator R A R B R C, as C = s,t,v [ ]A, B := A v B := j s l t [1, n] s+t, C j s l t = A j s B l t. 3.1 In the case of v = 0, the sum over disappears one multiplication is done instead of a sum of them. Until Section 8, we will focus on the case where R = R A = R B = R C, so all the tensor elements are in some possibly nonassociative ring [12] R, +,. We will employ the notation A v B rather than s,t,v [ ]A, B, unless we want to specify a special multiplicative operator other than we start using the special operators to define nested contractions in Section 7. We will also concentrate our analysis and proofs on cases where at least two of s, t, v are nonzero, since otherwise one of the tensors is a scalar and the problem is trivial. The algorithms we present reduce to an equivalent form and have the same cost for the cases when only one of s, t, v is nonzero. Throughout further contraction definitions and algorithms we will always denote ω := s+t+v and assume n ω. We will also introduce algorithm-dependent restrictions on the properties of, in particular distributivity will be needed for the symmetry preserving algorithm. We note for now that when is distributive, so is s,t,v [ ]. Often, contractions are written with indices in a different ordering then the one ascribed in 3.1, however, it is always possible to define a new tensor of the same order and contents, whose indices have the desired ordering e.g. transposing a matrix. We will not consider the overhead of transposing tensor indices throughout this paper, although the initial layout of tensor is important to consider in practice, especially when computing sequences of contractions on distributed-memory systems [4]. Definition 3.7. For any s, t, v 0, a symmetrized contraction is a contraction between tensors A and B into C, where the result is symmetrized, i.e. C = s,t,v [ ]A, B := A v B := i s + t [1, n] s+t, C i s+t = A j s B l t, j s l t Πi s+t The resulting tensor C satisfying 3.2 is always symmetric. We note that the symmetric contraction is distributive so long as the underlying scalar product is distributive, that is s,t,v [ ]A, B 1 + B 2 = t,s,v [ ]A, B

9 t,s,v [ ]A, B 2. For s = 1, t = 0, v = 1, 3.2 becomes multiplication of a matrix A with a vector b, A 1 b := Ab. For s = 1, t = 1, v = 0 and commutative, 3.2 becomes the rank-two vector outer product of a column vector a and a row vector b, a 0 b := ab + b T a T our definition of tensors does not distinguish between row and column vectors, as our definition of contractions permits vectors to behave as either, e.g. a 1 b is the inner product. For s = 1, t = 1, v = 1 and commutative, 3.2 becomes symmetrized multiplication of square matrices A and B, A 1 B := AB + B T A T. Definition 3.8. When A and B are symmetric, we call a symmetrized contraction between A and B, A v B a fully symmetric contraction. We note that the full-symmetric contraction is commutative so long as the underlying scalar product is commutative, that is A v B = B v A this is less evident in the more descriptive notation, s,t,v [ ]A, B = t,s,v [ ]B, A, but given v, the variables s and t are implied by the tensor order of A and B. However, s,t,v [ ] is not associative, and in order to nest s,t,v [ ] over different symmetric sets of indices or a single index e.g. by defining s1,t 1,v 1[ s2,t 2,v 2[ ]] see Section 7, we will not require that to be associative. We will return to and elaborate on this when we consider partially symmetric tensors in Section 7. An example of a a symmetric contraction is the symmetrized multiplication of real symmetric n-by-n matrices C = A 1 B = AB + BA. In the following sections, as we study algorithms for computing 3.2, we will perform summations over groups of indices of symmetric tensors. In all such summations, the symmetry of the summed indices can be exploited, so we introduce the following special summation notation for brevity. Definition 3.9. Let ρ = v!/ l i=1 m i! where m i is the multiplicity of the ith of 1 l v unique values in and then define f := ρf, k 1... k v where f is any function whose range is inside an Abelian group. By restricting f to yield only values of an Abelian group, we make it possible to sum the result as well as to scale a positive integer ρ. Lemma 3.1. Whenever f is symmetric under any permutation of any i.e. π Π v, f = fπ, f = r v fr v. Proof. By Definition 3.9, f := k 1... k v ρf. For any [1, n] v tuples that takes on in k 1... k v, consider all elements P [1, n] v, such that for any q v P, q v = π for some π Π v. P = ρ = v!/ l i=1 m i! where m i is the multiplicity of the ith of 1 l v unique values in. Since r v is over all r v [1, n]v, all tuples in P contribute to the sum for each [1, n] v, and all r v [1, n] v are in P for a unique [1, n] v. Further, since f is symmetric, we have q v P, fq v = fπ = f and so scaling each f by ρ and summing over [1, n] v, yields the same result as summing over all r v [1, n] v. Throughout the rest of the paper, we will primarily study fully symmetric contractions and introduce a new algorithm for evaluating these with fewer multiplications using an algebraic reorganization in Section 5. After analyzing this new algorithm, we will come back in Section 6 to considering evaluation of A v B when A and B are antisymmetric as well as define A v B, an antisymmetrized contraction of A and B. Using these we will also define and formulate algorithms for Hermitian contractions. 4. Direct Evaluation Tensor Contraction Algorithms We now discuss algorithms for evaluating A v B Definition 3.6 for nonsymmetric A and B and A v B for symmetric A and B Definition 3.8 that directly follow from the algebraic definitions in the previous section. Each algorithm we give is a specification of the scalar multiplications which are computed, but not the order of the 8

10 summations. Such a specification makes it possible to count the number of operations, but also to lower bound the communication cost required by any parallel schedule using any summation order to compute the algorithm [13] we leave this as future work and consider only computation cost in this paper. For any pair of operators ϱ =, +, we associate the computation cost µ ϱ with the distributive operator of the elements of A and B usually multiplication, and we the associate computation cost νϱ C with the Abelian group operator of the elements of C + usually addition. We ignore the costs associated with scaling values by a constant prefactor, which can in certain cases be significantly lower than those of the multiplication operator, and in most cases may be applied to either the output or the input tensors. Many of the direct evaluation algorithms correspond to performing matrix multiplication of an m k matrix and a k n matrix resulting in a m n matrix. We assume that the cost of matrix multiplication is to leading order F MM ϱ, m, n, k = ν C ϱ + µ ϱ mnk ν C ϱ mn, which corresponds to the cost of the classical matrix multiplication algorithm, and not a faster approach such as Strassen s algorithm [1]. The factor of ν C ϱ mn is due to the fact that the first mn multiplications may be written to the output rather than accumulated Direct Evaluation of Nonsymmetric Tensor Contractions We first consider the cost of the trivial algorithm which contracts nonsymmetric tensors A and B by evaluating 3.1. Algorithm 4.1 C = Υ s,t,v A, B. For any contraction C = s,t,v [ ]A, B with any operator R A R B R C, we define Υ s,t,v A, B to evaluate the multiplications ascribed directly by 3.1 in Definition 3.6. The correctness of the nonsymmetric contraction algorithm, namely that Υ s,t,v A, B = A v B, is trivially true. We note that the formulation of Υ s,t,v A, B requires direct evaluation of the multiplications in Definition 3.6 and therefore precludes reorganizations characteristic to fast matrix multiplication algorithms such as Strassen s algorithm [1] although Strassen s algorithm can be used when each tensor entry is itself a matrix and when corresponds to matrix multiplication. Theorem 4.1. The computation cost of Υ s,t,v A, B is F Υ ϱ, n, s, t, v = F MM ϱ, n s, n t, n v = ν C ϱ + µ ϱ n ω ν C ϱ n s+t. Proof. The algorithm Υ s,t,v A, B is equivalent to a single matrix multiplication of matrices Ā and B of dimensions n s n v and n v n t, respectively, which results in a matrix C. The matrix multiplication C = Ā B computes where the elements of the matrices are C JL = n v K=1 Ā JK := A j s where J = B KL := B l t where K = C JL := C j s l t where J = Ā JK B KL s j i n i 1 and K = i=1 v k i n i 1 and L = i=1 s j i n i 1 and L = i=1 v k i n i 1, i=1 t l i n i 1, i=1 t l i n i 1. The computation cost of Υ s,t,v A, B is therefore equal to the cost of multiplying matrices of dimensions n s n v and n v n t into a matrix with dimensions n s n t, F Υ ϱ, n, s, t, v = F MM ϱ, n s, n t, n v. Since ω = s + t + v, this nonsymmetric tensor contraction requires n ω n s+t additions and n ω multiplications. i=1 9

11 4.2. Direct Evaluation of Fully Symmetric Contractions The nonsymmetric algorithm may be used to compute 4.1, with only an additional step of symmetrization of the result of the multiplication between A and B. However, when A and B are symmetric, many of the scalar multiplications in 4.1 are equivalent. The following algorithm evaluates A v B by computing only the unique multiplications and scaling them appropriately. In particular, since C is symmetric, it is no longer necessary to compute all possible orderings of the indices i s + t [1, n] s+t in A v B, but only those in increasing order i s + t [1, n] s+t as these include all unique values of C. Further, permutations of the index group result in equivalent scalar multiplications due to symmetry of A and of B. So, in the following algorithm we rewrite 3.2 to sum over only the ordered sets of these indices and scale them by an appropriate prefactor. Algorithm 4.2 C = Ψ s,t,v A, B for Symmetric A and B. For any fully symmetric contraction C s,t,v [ ]A, B with symmetric A and B with any operator R A R B R C, see Definitions 3.7 and 3.8 compute C = Ψ s,t,v A, B := i s + t [1, n] s+t, C i s+t = s!t! j s,l t χi s+t A j s B l t, 4.1 C = Ψ s,t,v A, B employs the summation notation from 3.9 and assumes that elements of R C can be scaled by positive integers this can always be done with repeated additions. Theorem 4.2. When A and B are symmetric, Ψ s,t,v A, B = A v B. Proof. We first show that for C = A v B, which is symmetric since A and B are symmetric, the unique elements are given by i s + t [1, n] s+t, C i s+t = s!t! A j s B l t. j s,l t χi s+t The reduced form above holds because each index j s l t Πi s + t is equivalent under permutation of j s and l t to an index j s, l t χi s + t, that is j s l t Πi s + t, j s, l t χi s + t and π s Π s, π t Π t, such that π s j s π t l t = j s l t. The prefactor s!t! arises from the fact that Πi s + t = s + t! while χi s + t = s+t s = s + t!/s!t!, where the difference corresponds to the s! permutations of j s for each j s and t! permutations of l t for each l t. We note that by Definition 3.3 of Πi s + t the collection contains repeated values of index tuples when those contain repeated indices. We now note that due to the symmetry of A and B, j s l t [1, n] ω, A j s B l t = A j s k v B k v l t whenever there is π v Π v such that π v = k v. Thus applying Lemma 3.1, we conclude that j s l t [1, n] s+t, and further that Ψ s,t,v A, B = A v B. A j s B l t = A j s B l t, Theorem 4.3. The computation cost of the algorithm Ψ s,t,v A, B for symmetric A and B ignoring the cost of applying constant scaling factors is [ ] n n n n n n F Ψ ϱ, n, s, t, v = F ϱ, MM,, + νϱ C = ν C ϱ + µ ϱ s n s t n t n v v νϱ C n s + t s t s + t ν Cϱ + µ ϱ nω s!t!v!. 10

12 Proof. After ignoring the scaling by s!t! and by ρ in 4.1 the latter is implicit due to use of, see Definition 3.9, we argue that the computation can be wholly performed via a matrix multiplication followed by a symmetrization step of the result. The scaling by ρ can be applied prior to performing any multiplications to either A or B, and is non-trivial only when v > 1 which yields a trivial result in the antisymmetric case, as discussed later and only for elements of A or B with repeating indices approximately 1 n of the tensor elements. So, for simplicity, we ignore this scaling cost. Assuming that A has been prescaled its also possible to prescale B instead similarly, which is advantageous when s > t, j s [1, n] s+v, Āj s := ρa j s, the innermost summation in 4.1, corresponds to the computation j s [1, n] s, l t [1, n] t, Cj s l t := Ā j s B l t. k 1... k v The tensor elements C j s l t are partially symmetric, i.e. elements of C are symmetric under permutation of indices within j s and within l t, but not across these two tuples. Therefore, Ĉ has n n s t unique entries. The computation of Ĉ can be done via a matrix multiplication. We define n s n t matrix Ĉ, n s n v matrix Â, and n v n t matrix ˆB with elements s ji v ki Â JK := Āj s where J = and K =, i i i=1 i=1 v ki t li and L =, ˆB KL := B l t where K = Ĉ JL := C j s l t where J = i=1 s i=1 ji i i and L = i=1 t i=1 i li. i So C may be obtained via the matrix multiplication Ĉ = ÂˆB. To obtain the final result, C, it suffices to symmetrize, C i s+t = s!t! C j s l t. j s,l t χi s+t Each element of C contributes to a unique element of C, so ignoring the cost of the scaling by s!t!, the symmetrization requires n n s t n s+t additions. Therefore, the total cost of the algorithm is [ ] n n n n n n F Ψ ϱ, n, s, t, v = F ϱ, MM,, + νϱ C. s t v s t s + t Exploiting symmetry within the individual s, t, and v -sized index groups performing Ψ s,t,v A, B has been common practice in quantum chemistry tensor contraction implementations for decades, for instance it is explicitly discussed in [14], although the technique was probably employed in earlier coupled-cluster and Møller-Plesset perturbation-theory codes. The consideration and exploitation of such symmetry has also been automated for arbitraryorder tensors on distributed memory parallel computers by the Tensor Contraction Engine TCE [15] and Cyclops Tensor Framework [4]. 5. Symmetry Preserving Symmetric Tensor Contraction Algorithm Our main result is a symmetry preserving algorithm that performs any fully symmetric contraction using only n ω ω! /ω! multiplications to leading order, which is a factor of s!t!v! fewer than the direct evaluation method. However, the A, B algorithm is based on the idea of computing a symmetric intermediate tensor Ẑ of order ω, with elements corresponding to the result of a scalar multiplication. Due to its symmetry this tensor has n ω unique entries. The tensor Ẑ is accumulated into order s + t tensor Z, which contains all the terms needed for the output C, but also some extra terms. However, all of these extra terms may be computed and cancelled out with low-order cost. In particular, as we show in the proof algorithm performs more additions per multiplication than the direct evaluation method. The Φ s,t,v of correctness in Section 5.1, the V and W tensors computed in Φ s,t,v Z from the C we seek to compute 3.2. A, B contain all the terms which separate 11

13 Algorithm 5.1 C = Φ s,t,v A, B for Symmetric A and B. For any fully symmetric contraction C = s,t,v [ ]A, B with a distributive operator R A R B R C, compute the unique elements of the symmetric order ω tensor Ẑ, i ω [1, n] ω, Ẑ i ω = A j s+v B l t+v. 5.1 Accumulate Ẑ into order s + t symmetric tensor Z, j s+v χi ω i s + t [1, n] s+t, Z i s+t = l t+v χi ω Ẑ i s+t. 5.2 Define A 0 = A as well as B 0 = B and compute A p as well as B q p, q [1, v] as follows, p [1, v], i s + v p [1, n] t+v q, q [1, v], i t + v q [1, n] t+v q, A p i s+v p = A i s+v p e p, 5.3 B q e p i t+v q = B f q i t+v q. 5.4 f q Compute order s + t symmetric tensor V from the above intermediates, i s + t [1, n] s+t, v 1 v v r V i s+t = r r=0 [ k r p=max0,v t r j s+v p r χi s+t v r p v p r q=max0,v s r A p j s+v p r k r v p r q n v p q r l t+v q r χi s+t B q k r l t+v q r ]. 5.5 Compute partially symmetric tensors U r, r [1, mins, t], which are of order s + t r, and symmetric in a group of r indices and symmetric in another group of s + t 2r indices, r [1, mins, t], m r [1, n] r, i s + t 2r [1, n] s+t 2r, U r m r i s+t 2r = A m r j s r B l t r m r. 5.6 j s r,l t r χi s+t 2r Compute symmetric order s + t intermediate W from the U r tensors, i s + t [1, n] s+t, W i s+t = mins,t r=1 m r,h s+t 2r χi s+t Lastly, compute the unique elements of the final output of the algorithm C via the summation, U r m r h s+t 2r. 5.7 i s + t [1, n] s+t, C i s+t =s!t! Z i s+t V i s+t W i s+t. 5.8 We now prove correctness of the symmetry preserving algorithm, Φ s,t,v A, B = A v B in Section 5.1, numerical stability in Section 5.2, and analyze the computation cost as well as the memory footprint in Section Proof of Correctness Theorem 5.1. For any symmetric tensors A and B as well as any distributive operator R A R B R C, A, B = s,t,v [ ]A, B. Φ s,t,v 12

14 Proof. To show algebraic equivalence of the formulation in to 3.2 we show that V 5.5 and W 5.7 cancel the extra terms included in Z 5.2, leaving exactly the terms needed by 3.2. We first check that Ẑ is symmetric as asserted by the algorithm. Since A and B are both symmetric, we can show that each operand to the multiplication forming each Z i ω stays the same under permutation of any pair of indices i p and i q in the i ω index group. The terms in the summation forming each operand we consider A but the same holds for B fall into three cases: Both indices i p and i q appear in the term, in which case the term stays the same since the tensor is symmetric. One of two indices appear in some term, without loss of generality let i p appear and not i q, after permutation we can write this term as A j s+v 1 ip where j s + v 1 i ω and j s + v 1 does not include i p or i q. Now, we can assert there is another term in the summation of the form A j s+v 1 iq, since χi ω yields all possible ordered subsets of i ω, which must include an ordered index set containing the distinct indices j s + v 1 i q. Permutation of i p and i q leads to these two terms being swapped. If neither of the two i q and i p indices appear, the term stays the same trivially. Thus Ẑ is symmetric. Further, since the operands in the summations computing Ẑ 5.2 as well as all A p 5.3 and A q 5.4 are symmetric, so are these tensors. Each U r 5.6 is partially symmetric since any permutation of indices inside m r is reflected in both operands to each matrix multiplication, while the partition function j s r, l t r χi s + t 2r ensures that any permutation of indices within i s + t 2r either leads to a permutation of indices only within A or only with B or swaps two terms similar to what occurs for Ẑ, which is explained in detail above. Also by a similar argument as given for Ẑ, we can conclude that W and V are symmetric. Having demonstrated symmetry of the tensors, we now note that each summand of in Φ s,t,v A, B is symmetric over the set of indices being summed, so using Lemma 3.1, we replace each by a regular in the rest of the correctness proof. This swap allows us to proceed paying no further attention to tensor symmetry and assuming the full tensors are being computed since we have already shown all the unique elements of the symmetric tensors are computed. There are a total of n v ω ω s+v t+v products Aj s+v B l t+v contributing to each Z i s+t in the expansion of 5.2 with replaced by. We partition these subterms constituting Z according to which indices based on subscript m of k m rather than the value that k m takes on show up in the j s + v indices of the A operand and the l t + v indices of the B operand: x a χ, the a v sized subset of chosen by j s + v χi s + t to appear in the A operand, y b χ, the b v sized subset of chosen by l t + v χi s + t to appear in the B operand. Now we consider all possible cases of overlap of the indices appearing in x a and y b, let d r χ, for r mina, b v be the subset of that appears both in x a and y b, let e p χ, for p a r be the subset of that appears exclusively in x a and not in y b, let f q χ, for q b r v p r be the subset of that appears exclusively in y b and not in x a, let g v q p r χ be the remaining subset of, which appear in neither x a nor y b. Using distributivity of, we now partition Z according to the above partitions of, i s + t [1, n] s+t, v v r v r p Z i s+t = r=0 d r,u v r χ f q,g v p q r χw v p r p=max0,v t r [ [ j s+v p r χi s+t [ e p,w v p r χu v r A j s+v p r d r e p ] ] ] B f q d r l t+v q r. l t+v q r χi s+t q=max0,v s r 13

15 We note the fact that the index variables in are indistinguishable i.e. they are dummy integration variables, which implies that for a given r, p, q no matter how the is partitioned amongst d r, e p, f q, and g v p q r, the result is the same. This allows us to replace and its partitions with new sets of variables and multiply by prefactors corresponding to the number of selections possible for each partition, i s + t [1, n] s+t, Z i s+t = v r=0 [ v r d r j s+v p r χi s+t v r p=max0,v t r v r p e p A j s+v p r d r e p ] [ v r p q=max0,v s r l t+v q r χi s+t v r p q f q g v p q r B f q d r l t+v q r ]. Since B is independent of e p and A is independent of f q, we can use distributivity to bring these summations inside the parentheses. Further, since both A and B are independent of the indices g v p q r, we can replace this summation with a prefactor of n v r p q the size of the range of g v p q r, i s+t [1, n] s+t, Z i s+t = [ v r=0 j s+v p r χi s+t k r v v r v r r p p=max0,v t r ] [ A j s+v p r d r e p e p j s+v p r χi s+t v r p q=max0,v s r v r p q l t+v q r χi s+t n v r p q d r B f q d r l t+v q r ]. Isolating r = v, p = 0, q = 0 and substituting in A p and B q, allows us to extract the V term 5.5, i s + t [1, n] s+t, v 1 v v r v p r v r v p r Z i s+t = n v p q r r p q r=0 p=max0,v t r q=max0,v s r [ ] A p j s+v p r k r B q k r l t+v q r + =V i s+t + j s χi s+t A j s j s χi s+t A j s l t χi s+t f q l t+v q r χi s+t B l t l t χi s+t B l t We have now isolated the subterms where both A and B are contracted over the desired full set of indices into the latter term in the equation above. However, this latter term still contains subterms where the same outer noncontracted indices appear in both A and B, which we want to discard, since 3.2 contains only disjoint partitions of the i s + t index set among A and B. We obtain W 5.7 by enumerating all such remaining unwanted terms by the number of outer indices r [1, mins, t] which appear in both operands A and B, then summing over the disjoint partitions of the remaining s t 2r indices into s r indices which appear exclusively in term A and t r indices which appear exclusively in term B. This leaves r indices which appear in neither A nor B, which is the reason that the term V can be computed in low order via U 5.6. So now, i s + t [1, n] s+t, we have Z i s+t = V i s+t + A j s B l t + mins,t r=1 [ j s,l t χi s+t m r,h s+t 2r χi s+t = V i s+t + j s,l t χi s+t [ j s r,l t r χh s+t 2r. ] ] A m r j s r B l t r m r A j s B l t + W i s+t 14

16 Plugging the result into 5.8, we obtain i s + t [1, n] s+t, C i s+t = s!t! Z i s+t V i s+t W i s+t = s!t! j s,l t χi s+t A j s B l t. We can now simply recognize that above formula to C i s+t is equivalent to 4.1, i.e. what is computed by Ψ s,t,v A, B, which is the same as A v B by Theorem 4.2, therefore, Φ s,t,v A, B = A v B Numerical Stability Analysis nɛ We derive error bounds in terms of γ n = 1 nɛ, where ɛ is the machine precision. We assume the error made in scalar additions, subtractions, and multiplications to be bounded equivalently fl ab ab ab ɛ and fl a ± b a ± b a ± b ɛ. We use the X max norm, which is the largest magnitude element in X. The forward error bound for the direct evaluation symmetric contraction algorithm Ψ s,t,v A, B arises directly from matrix multiplication where n v /v! + On v 1 scalar intermediates contribute to each entry of the output, with an extra factor of s+t s incurred from the symmetrization of the partially-symmetric intermediate C the result of the matrix multiplication. Therefore, the bound for C grows by the same factor as the error, yielding the overall error bound, fl Ψ s,t,v A, B A v B max γ m m A max B max where m = nv s + t + On v 1. v! s To bound the error of the fast algorithm, we start with a lemma, which simplifies the error analysis of multiple intermediate tensors. Lemma 5.2. Consider a computation of a scalar from l-by-m matrix A and k-by-g matrix B of the form, c = m l g=1 i=1 A ig k j=1 B jg, where A max α and B max β. The floating point error of c is bounded by fl c c γ m+l+k mlkαβ + Oɛ 2. Proof. The magnitude of the exact value a g = l i=1 A ig is at most lα, so the floating point error incurred is bounded by γ l lα. Similarly, for b g = k j=1 B jg the floating point error is at most γ k kβ. Therefore, we can obtain a bound on the floating point error for c via m fl c c = fl a g + δg A lαb g + δg B kβ c, where δ A g γ l and δ B g γ k, g=1 fl c c γ m mlαkβ + mγ l lαkβ + mlαγ k kβ + Oɛ 2 γ m+l+k mlkαβ + Oɛ 2 The following theorem bounds the forward error of the computation done by the symmetry preserving tensor contraction algorithm. The bound for the symmetry preserving algorithm is higher than for the direct evaluation algorithm, due to errors accumulated in the computation of the intermediate terms. In the theorem and derivation we employ the convention that n n = 0 for any n > n. 15

Strassen-like algorithms for symmetric tensor contractions

Strassen-like algorithms for symmetric tensor contractions Strassen-lie algorithms for symmetric tensor contractions Edgar Solomoni University of Illinois at Urbana-Champaign Scientfic and Statistical Computing Seminar University of Chicago April 13, 2017 1 /