Strassen-like algorithms for symmetric tensor contractions

Size: px

Start display at page:

Download "Strassen-like algorithms for symmetric tensor contractions"

Sydney Palmer
5 years ago
Views:

1 Strassen-lie algorithms for symmetric tensor contractions Edgar Solomoni University of Illinois at Urbana-Champaign Scientfic and Statistical Computing Seminar University of Chicago April 13, / 27 Fast symmetric tensor contractions

2 Outline 1 Introduction 2 Applications of tensor symmetry 3 Exploiting symmetry in matrix products 4 Exploiting symmetry in tensor contractions 5 Numerical error analysis 6 Exploiting partial-symmetry in tensor contractions 7 Communication cost analysis 8 Summary and conclusion 2 / 27 Fast symmetric tensor contractions

3 Terminology A tensor T R n 1 n d has order d (i.e. d modes / indices dimensions n 1 -by- -by-n d (in this tal, usually each n i = n elements T i 1...i d = T i where i {1,..., n} d We say a tensor is symmetric if for any j, {1,..., n} T i 1...i j...i...i d = T i 1...i...i j...i d A tensor is antisymmetric (sew-symmetric if for any j, {1,..., n} T i 1...i j...i...i d = ( 1T i 1...i...i j...i d A tensor is partially-symmetric if such index interchanges are restricted to be within subsets of {1,..., n}, e.g. T ij l = T ji l = T ji l = T ij l 3 / 27 Fast symmetric tensor contractions

4 Tensor contractions We wor with contractions of tensors A of order s + v, and B of order v + t into C of order s + t, defined as C ij = A i B j {1,...,n} v requires O(s + t + v multiplications and additions }{{} ω assumes an index ordering, but does not lose generality wors with any symmetries of A and B is extensible to symmetries of C via symmetrization (sum all permutations of modes in C, denoted [C] ij generalizes simple matrix operations, e.g. (s, t, v = (1, 0, 1, }{{} (s, t, v = (1, 1, 0, }{{} (s, t, v = (1, 1, 1 }{{} matrix-vector product vector outer product matrix-matrix product 4 / 27 Fast symmetric tensor contractions

5 Applications of symmetric tensor contractions Symmetric and Hermitian matrix operations are part of the BLAS matrix-vector products: symv (symm, hemv, (hemm symmetrized outer product: syr2 (syr2, her2, (her2 these operations dominate symmetric/hermitian diagonalization Hanel matrices are order 2 log 2 (n partially-symmetric tensors H = where H 11, H 21, H 22 are also Hanel. [ H11 H T 21 H 21 H 22 In general, partially-symmetric tensors are nested symmetric tensors a nonsymmetric matrix is a vector of vectors T ij l = T ji l = T ji l = T ij l is a symmetric matrix of symmetric matrices ] 5 / 27 Fast symmetric tensor contractions

6 Applications of partially-symmetric tensor contractions High-accuracy methods in computational quantum chemistry solve the multi-electron Schrödinger equation H Ψ = E Ψ, where H is a linear operator, but Ψ is a function of all electrons use wavefunction ansatze lie Ψ Ψ ( = e T ( Ψ ( 1 where Ψ (0 is a determinant function and T ( is an order 2 tensor, acting as a multilinear excitation operator on the electrons are most commonly versions of coupled-cluster methods which use the above ansatze for {2, 3, 4} (CCSD, CCSDT, CCSDTQ solve iteratively for T (, where each iteration has cost O(n 2+2, dominated by contractions of partially antisymmetric tensors for example, a dominant contraction in CCSD ( = 2 is Z a i c = n n b=1 j=1 T ab ij V j b c where T ab ij = T ba ij = T ba ji = T ab ji. We ll show an algorithm that requires n 6 rather than 2n 6 operations. 6 / 27 Fast symmetric tensor contractions

7 Symmetric matrix times vector Let b be a vector of length n with elements in ϱ Let A be an n-by-n symmetric matrix with elements in ϱ We multiply matrix A by b, A ij = A ji c = A b n c i = A ij b j }{{} j=1 nonsymmetric This usual form has cost (ignoring low-order terms here and later T symv (ϱ, n = µ ϱ n 2 + ν ϱ n 2 where µ ϱ is the cost of multiplication and ν ϱ of addition 7 / 27 Fast symmetric tensor contractions

8 Fast symmetric matrix times vector We can perform symv using fewer element-wise multiplications, n n c i = A ij (b i + b j A ij b i }{{} j=1 j=1 symmetric }{{} requires only n mults. A ij (b i + b j is symmetric, requires ( n+1 2 multiplications ( nj=1 A ij b i may be computed with n multiplications The total cost of the new form is T symv(ϱ, n = µ ϱ 1 2 n2 + ν ϱ 5 2 n2 This formulation is cheaper when µ ϱ > 3ν ϱ Form symm the formulation is cheaper when µ ϱ > ν ϱ 8 / 27 Fast symmetric tensor contractions

9 Symmetric ran-2 update Consider a ran-2 outer product of vectors a and b of length n into symmetric matrix C C = a b T + b a T [ C ij = a b T] a i b j + a j b i ij }{{}}{{} nonsymmetric permutation Usually computed via the n 2 multiplications a i b j with the cost T syr2 (ϱ, n = µ ϱ n 2 + ν ϱ n 2. 9 / 27 Fast symmetric tensor contractions

10 Fast symmetric ran-2 update We may compute the ran-2 update via a symmetric intermediate quantity C ij = (a i + a j (b i + b j }{{} a i b i a j b j }{{} symmetric requires only n mults in total We can compute all (a i + a j (b i + b j in ( n+1 2 multiplications The total cost is then given to leading order by T syr2(ϱ, n = µ ϱ 1 2 n2 + ν ϱ 5 2 n2. T syr2 (ϱ, n < T syr2(ϱ, n when µ ϱ > 3ν ϱ T syr2k (ϱ, n, K < T syr2k(ϱ, n, K when µ ϱ > ν ϱ 10 / 27 Fast symmetric tensor contractions

11 Symmetrized product of symmetric matrices Given symmetric matrices A, B of dimension n on non-associative commutative ring ϱ, we see to compute the anticommutator of A and B C = A B + B A n n C ij = [A B] ij A i B j + A j B i }{{}}{{} =1 =1 nonsymmetric permutation The above equations require n 3 multiplications and n 3 adds for a total cost of T syrmm (ϱ, n = µ ϱ n 3 + ν ϱ n 3. Note that the symmetrized product defines a non-associative commutative ring (the Jordan ring over the set of symmetric matrices. 11 / 27 Fast symmetric tensor contractions

12 Fast symmetrized product of symmetric matrices We can combine the ideas from the fast routines for symv and syr C ij = (A ij + A i + A j (B ij + B i + B j }{{} A ij ( n+2 symmetric, requires 3 mults ( B ij + B i + B j } ( {{} n+1 requires 2 mults A i B i A j B j }{{} requires ( n+1 2 mults ( B ij A ij + A i + A j } ( {{} n+1 requires 2 mults The reformulation requires ( n 3 multiplications to leading order, T syrmm(ϱ, n = µ ϱ 1 6 n3 + ν ϱ 5 3 n3, which is faster than T syrmm when µ ϱ > (4/5ν ϱ. 12 / 27 Fast symmetric tensor contractions

13 Fast symmetrized product of symmetric matrices We can rewrite that algorithm in terms using symmetrization notation: C ij = [AB] ij = (A ij + A i + A j (B ij + B i + B j } {{} [A] ij [B] ij ( ( A ij B ij + B i + B j B ij A ij + A i + A j }{{} A ij [B] ij }{{} B ij [A] ij A i B i A j B j }{{} [A i 1 B] ij = [A] ij [B] ij A ij [B] ij B ij [A] ij [A 1 B] ij where A 1 B = A i B j 13 / 27 Fast symmetric tensor contractions

14 Comparison to Strassen s algorithm [n ω /(s!t!v!] / #multiplications (speed-up over classical direct evaluation alg Symmetry preserving alg. vs Strassen s alg. (s=t=v=ω/3 Strassen s algorithm Sym. preserving ω=6 Sym. preserving ω= e+06 n s /s! (matrix dimension 14 / 27 Fast symmetric tensor contractions

15 Fully symmetric tensor contractions We can now state the general symmetric tensor contraction algorithms, given A of order s + v, and B of order v + t into C of order s + t, defined as we define the (nonsymmetrized contraction as C = A v B where C ij = A i B j {1,...,n} v We can then define the symmetrized tensor contraction as C i = [A v B] i The usual method first computes A v B with cost ( ( ( n n n T syctr(ϱ, n, s, t, v = µ ϱ + ν ϱ s t v ( n s ( n t 15 / 27 Fast symmetric tensor contractions ( n v.

16 Fast fully-symmetric contraction algorithm The fast algorithm is defined as follows (using ω = s + t + v C i = [A] i [B] i {1,...,n} } v {{} ( n+ω 1 symmetric, requires ω v multiplications ( ( [A] ip [B] iq p+q=1 {1,...,n} v p q p {1,...,n} p q {1,...,n} }{{ q } requires O(n ω 1 multiplications min(s,t [A v+r B] i r=1 }{{} requires O(n ω 1 multiplications The leading order cost is T syctr(ϱ, n, s, t, v = µ ϱ ( n + ν ϱ ω ( n ω [ ( ω + t 16 / 27 Fast symmetric tensor contractions ( ω + s ( ] ω. v

17 Reduction in operation count of fast algorithm with respect to standard reduction factor Reduction in operation count for different entry types (s+t=ω entries are matrices (s+t=ω entries are complex (s+t=ω entries are real ω Reduction in operation count for different entry types (s+t+v=ω entries are matrices (s+t+v=ω entries are complex (s+t+v=ω entries are real (s, t, v values for left and right graph tabulated below ω ω Left graph (1, 0, 0 (1, 1, 0 (2, 1, 0 (2, 2, 0 (3, 2, 0 (3, 3, 0 Right graph (1, 0, 0 (1, 1, 0 (1, 1, 1 (2, 1, 1 (2, 2, 1 (2, 2, 2 17 / 27 Fast symmetric tensor contractions

18 Theoretical error bounds We express error bounds in terms of γ n = precision. nɛ 1 nɛ, where ɛ is the machine Let Ψ be the standard algorithm and Φ be the fast algorithm. The error bound for the standard algorithm arises from matrix multiplication ( ( n ω fl (Ψ(A, B C γ m A B where m =. v v The following error bound holds for the fast algorithm ( n fl (Φ(A, B C γ m A B where m = 3 v ( ω t ( ω s. 18 / 27 Fast symmetric tensor contractions

19 Stability of symmetry preserving algorithms 19 / 27 Fast symmetric tensor contractions

20 Nesting the fast algorithm For partially-(antisymmetric contractions we can nest the new algorithm over each group of symmetric modes reduction in mults can translate to reduction in the number of operations for Hanel matrices, yields sub O(n 2 algorithm, but not O(n or O(n log(n as reduction in mults is a factor of two only in leading order but for coupled-cluster contractions, significant reductions in cost can be achieved CCSD 1.3X on a typical system CCSDT 2.1X on a typical system CCSDTQ 5.7X on a typical system 20 / 27 Fast symmetric tensor contractions

21 Communication cost of the standard algorithm We consider communication bandwidth cost on a sequential machine with cache size M. The intermediate formed by the standard algorithm may be computed via matrix multiplication with communication cost, (( n ( n ( n s t W (n, s, t, v, M = Θ v + M ( n + s + v ( ( n n +. t + v s + t The cost of symmetrizing the resulting intermediate is low-order or the same. 21 / 27 Fast symmetric tensor contractions

22 Communication cost of the fast algorithm We can lower bound the cost of the fast algorithm using the Hölder-Brascamp-Lieb inequality. An algorithm that blocs Z symmetrically nearly attains the cost ( n [( ( ( ] W ω ω ω ω (n, s, t, v, M = O( M ω/(ω min(s,t,v + + t s v ( ( ( n n n s + v t + v s + t which is not far from the lower bound and attains it when s = t = v. 22 / 27 Fast symmetric tensor contractions

23 factor of reduction in communication volume Reduction in communication (W/W fast alg comm reduction forr s+t+v=ω fast alg comm reduction for s+t=ω ω 23 / 27 Fast symmetric tensor contractions

24 Further communication lower bounds We can use bilinear algorithm formulations to derive comm. lower bounds symmetry preserving tensor contraction algorithms have arbitrary order projections from order d set bilinear algorithms 1 provide a more general framewor a bilinear algorithm is defined by matrices F (A, F (B, F (C, c = F (C [(F (AT a (F (BT b] where is the Hadamard (pointwise product communication lower bounds derived based on matrix ran 2 1 Pan, Springer, S., Hoefler, Demmel, in preparation 24 / 27 Fast symmetric tensor contractions

25 Communication cost of symmetry preserving algorithms For contraction of order s + v tensor with order v + t tensor 3 symmetry preserving algorithm requires (s+v+t! s!v!t! fewer multiplies matrix-vector-lie algorithms (min(s, v, t = 0 vertical communication dominated by largest tensor horizontal communication asymptotically greater if only unique elements are stored and s v t matrix-matrix-lie algorithms (min(s, v, t > 0 vertical and horizontal communication costs asymptotically greater for symmetry preserving algorithm when s v t 3 S., Hoefler, Demmel; Technical Report, ETH Zurich, / 27 Fast symmetric tensor contractions

26 Summary of results The following table lists the leading order number of multiplications F required by the standard algorithm and F by the fast algorithm for various cases of symmetric tensor contractions, ω s t v F F applications n 2 n 2 /2 syr2, syr2, her2, her n 2 n 2 /2 symv, symm, hemv, hemm n 3 n 3 /6 symmetrized matmul ( s+t+v s t v n ω any symmetric tensor contraction ( n ( n ( n s t v High-level conclusions: The fast symmetric contraction algorithms provide interesting potential arithmetic cost improvements for complex BLAS routines and partially symmetric tensor contractions. However, the new algorithms require more communication per flop, incur more numerical error, and usually unable to exploit fused-multiply-add units or bloced matrix multiplication primitives. 26 / 27 Fast symmetric tensor contractions

27 Acnowledgements and references Collaborators on various parts: James Demmel Torsten Hoefler Devin Matthews S., Demmel; Technical Report, ETH Zurich, / 27 Fast symmetric tensor contractions

28 Bacup slides 28 / 27 Fast symmetric tensor contractions

29 The fast algorithm for computing C forms the following intermediates with ( n ω multiplications (where ω = s + t + v, ( Z i = V i = + W i = j χ(i ( j χ(i ( ( A j A j B l l χ(i ( 1 l χ(i ( A j 1 j χ(i ( ( A j j χ(i l χ(i C i = Z i V i ( W j j χ(i l χ(i B l B l B l 29 / 27 Fast symmetric tensor contractions

Strassen-like algorithms for symmetric tensor contractions

Strassen-like algorithms for symmetric tensor contractions Edgar Solomonik Theory Seminar University of Illinois at Urbana-Champaign September 18, 2017 1 / 28 Fast symmetric tensor contractions Outline