Strassen s Algorithm for Tensor Contraction

Similar documents
Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

arxiv: v3 [cs.ms] 7 Nov 2017

Matrix-Matrix Multiplication

Cyclops Tensor Framework

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

A tale of efficiency and productivity. From scalar to tensor computations.

Efficient algorithms for symmetric tensor contractions

Updating an LU factorization with Pivoting. FLAME Working Note #21

Using BLIS for tensor computations in Q-Chem

I Don t Care about BLAS

RWTH Aachen University

Level-3 BLAS on a GPU

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition

A Design of a High-Performance GEMM-like Tensor-Tensor Multiplication

c 2016 Society for Industrial and Applied Mathematics

CS 598: Communication Cost Analysis of Algorithms Lecture 9: The Ideal Cache Model and the Discrete Fourier Transform

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Parallel Transposition of Sparse Data Structures

Strassen-like algorithms for symmetric tensor contractions

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

More on Matrix Inversion

Chenhan D. Yu The 3rd BLIS Retreat Sep 28, 2015

MARCH 24-27, 2014 SAN JOSE, CA

Performance Prediction for Tensor Contractions

Design of Scalable Dense Linear Algebra Libraries for Multithreaded Architectures: the LU Factorization

Parallel Sparse Tensor Decompositions using HiCOO Format

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

A hybrid Hermitian general eigenvalue solver

Strassen-like algorithms for symmetric tensor contractions

Week6. Gaussian Elimination. 6.1 Opening Remarks Solving Linear Systems. View at edx

CS 542G: Conditioning, BLAS, LU Factorization

EXPLOITING SYMMETRY IN TENSORS FOR HIGH PERFORMANCE

Leveraging sparsity and symmetry in parallel tensor contractions

HPMPC - A new software package with efficient solvers for Model Predictive Control

Efficient Evaluation of Matrix Polynomials

Parallel Polynomial Evaluation

Algorithms and Methods for Fast Model Predictive Control

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

arxiv: v1 [cs.ms] 11 Oct 2017

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

hal , version 1-29 Feb 2008

ECS289: Scalable Machine Learning

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Restructuring the Symmetric QR Algorithm for Performance. Field Van Zee Gregorio Quintana-Orti Robert van de Geijn

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

Practicality of Large Scale Fast Matrix Multiplication

Provably efficient algorithms for numerical tensor algebra

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Lower Bounds on Algorithm Energy Consumption: Current Work and Future Directions. March 1, 2013

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

Communication avoiding parallel algorithms for dense matrix factorizations

Intel Math Kernel Library (Intel MKL) LAPACK

Sparse Polynomial Multiplication and Division in Maple 14

Performance Evaluation of Scientific Applications on POWER8

Optimizing Time Integration of Chemical-Kinetic Networks for Speed and Accuracy

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Polynomial multiplication and division using heap.

Quality Up in Polynomial Homotopy Continuation

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms

arxiv: v1 [cs.dc] 19 Nov 2016

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Vector Lane Threading

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Notes on LU Factorization

High Performance Emulation of Quantum Circuits

Parallel Algorithms for Reducing the Generalized Hermitian-Definite Eigenvalue Problem

Ilya A. Kaliman* and Anna I. Krylov. Introduction

Performance, Power & Energy

Computing least squares condition numbers on hybrid multicore/gpu systems

Families of Algorithms for Reducing a Matrix to Condensed Form

2.5D algorithms for distributed-memory computing

Communication-avoiding LU and QR factorizations for multicore architectures

Toward High Performance Matrix Multiplication for Exact Computation

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

Sparse BLAS-3 Reduction

ERLANGEN REGIONAL COMPUTING CENTER

Balanced Dense Polynomial Multiplication on Multicores

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Maximum-weighted matching strategies and the application to symmetric indefinite systems

A preliminary analysis of Cyclops Tensor Framework

Software optimization for petaflops/s scale Quantum Monte Carlo simulations

Towards Fast, Accurate and Reproducible LU Factorization

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Towards Mechanical Derivation of Krylov Solver Libraries (submitted to ICCS 2010)

A Blackbox Polynomial System Solver on Parallel Shared Memory Computers

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Transcription:

Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute, 162 Fifth Ave, New York City.

Marry Strassen with Tensor Contraction M 0 := (A 00 +A 11 )(B 00 +B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 B 11 ); M 3 := A 11 (B 10 B 00 ); M 4 := (A 00 +A 01 )B 11 ; M 5 := (A 10 A 00 )(B 00 +B 01 ); M 6 := (A 01 A 11 )(B 10 +B 11 ); C 00 += M 0 + M 3 M 4 + M 6 C 01 += M 2 + M 4 C 10 += M 1 + M 3 C 11 += M 0 M 1 + M 2 + M 5 Practical Speedup? O(n 3 ) O(n 2.8 )

Outline Background High-performance GEMM High-performance Strassen High-performance Tensor Contraction Strassen s Algorithm for Tensor Contraction Performance Model Experiments Conclusion 3

High-performance matrix multiplication (GEMM) 4

State-of-the-art GEMM in BLIS BLAS-like Library Instantiation Software (BLIS) is a portable framework for instantiating BLAS-like dense linear algebra libraries. Field Van Zee, and Robert van de Geijn. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM TOMS 41.3 (2015): 14. BLIS provides a refactoring of GotoBLAS algorithm (best-known approach on CPU) to implement GEMM. Kazushige Goto, and Robert van de Geijn. High-performance implementation of the level-3 BLAS. ACM TOMS 35.1 (2008): 4. Kazushige Goto, and Robert van de Geijn. Anatomy of high-performance matrix multiplication. ACM TOMS 34.3 (2008): 12. GEMM implementation in BLIS has 6-layers of loops. The outer 5 loops are written in C. The inner-most loop (micro-kernel) is written in assembly for high performance. 5

GotoBLAS algorithm for GEMM in BLIS k C x n C L3 Cache m C x k C L2 Cache m R x n R m R x k C k C x n R Register *Field G. Van Zee, and Tyler M. Smith. Implementing high-performance complex matrix multiplication via the 3m and 4m methods. In ACM Transactions on Mathematical Software (TOMS), accepted. 6

High-performance Strassen *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. Strassen s Algorithm Reloaded. In SC 16. 7

Strassen s Algorithm Reloaded M 0 := (A 00 +A 11 )(B 00 +B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 B 11 ); M 3 := A 11 (B 10 B 00 ); M 4 := (A 00 +A 01 )B 11 ; M 5 := (A 10 A 00 )(B 00 +B 01 ); M 6 := (A 01 A 11 )(B 10 +B 11 ); C 00 += M 0 + M 3 M 4 + M 6 C 01 += M 2 + M 4 C 10 += M 1 + M 3 C 11 += M 0 M 1 + M 2 + M 5 M 0 := (A 00 +A 11 )(B 00 +B 11 ); C 00 += M 0 ; C 11 += M 0 ; M 1 := (A 10 +A 11 )B 00 ; C 10 += M 1 ; C 11 = M 1 ; M 2 := A 00 (B 01 B 11 ); C 01 += M 2 ; C 11 += M 2 ; M 3 := A 11 (B 10 B 00 ); C 00 += M 3 ; C 10 += M 3 ; M 4 := (A 00 +A 01 )B 11 ; C 01 += M 4 ; C 00 = M 4 ; M 5 := (A 10 A 00 )(B 00 +B 01 ); C 11 += M 5 ; M 6 := (A 01 A 11 )(B 10 +B 11 ); C 00 += M 6 ; M := (X+Y)(V+W); C +=M; D +=M; General operation for one-level Strassen: M := (X+dY)(V+eW); C += g 0 M; D += g 1 M; g 0, g 1,d,e {-1, 0, 1}. *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. Strassen s Algorithm Reloaded. In SC 16. 8

M := (X+Y)(V+W); C += M; D += M; *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. Strassen s Algorithm Reloaded. In SC 16. 9

C += AB; M := (X+Y)(V+W); C += M; D += M; 10

C += AB; M := (X+Y)(V+W); C += M; D += M; k C x n C L3 Cache m C x k C L2 Cache m R x n R Register 11

High-performance Tensor Contraction Devin A. Matthews. High-Performance Tensor Contraction without Transposition. Accepted in SISC. 12

Matrix vs. Tensor Matrix Multiplication Tensor Contraction BLAS/BLIS! TBLIS/ GETT Paul Springer, and Paolo Bientinesi. "Design of a high-performance GEMM-like tensor-tensor multiplication." arxiv preprint arxiv:1607.00145 (2016). Devin A. Matthews. High-Performance Tensor Contraction without Transposition. Accepted in SISC. 13

C := AB + C 14

Outline Background High-performance GEMM High-performance Strassen High-performance Tensor Contraction Strassen s Algorithm for Tensor Contraction Performance Model Experiments Conclusion 15

Matrix vs. Tensor Matrix Multiplication Tensor Contraction BLAS/BLIS! TBLIS/ GETT Paul Springer, and Paolo Bientinesi. "Design of a high-performance GEMM-like tensor-tensor multiplication." arxiv preprint arxiv:1607.00145 (2016). Devin A. Matthews. High-Performance Tensor Contraction without Transposition. Accepted in SISC. 16

Matrix vs. Tensor Matrix Multiplication Tensor Contraction BLAS/BLIS! TBLIS/ GETT 17

Tensors As Matrices: Block-Scatter-Matrix View Tensor:, with 8x2x4 d dimension is stride-1, other dimensions have increasing strides (8, 16). c a d 48 49 50 51 52 53 54 55 32 33 34 35 36 37 38 39 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 56 57 58 59 60 61 62 63 40 41 42 43 44 45 46 47 24 25 26 27 28 29 30 31 8 9 10 11 12 13 14 15 Devin A. Matthews. High-Performance Tensor Contraction without Transposition. Accepted in SISC. 18

Tensors As Matrices: Block-Scatter-Matrix View Tensor:, with 8x2x4 d dimension is stride-1, other dimensions have increasing strides (8, 16). c a d 48 49 50 51 52 53 54 55 32 33 34 35 36 37 38 39 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 56 57 58 59 60 61 62 63 40 41 42 43 44 45 46 47 24 25 26 27 28 29 30 31 Matrix:, with Column ac dimension has stride of c (8x2=16). Row d dimension has is stirde-1. (i.e. A is row-major.), store offset for each position in rows or columns: 8x8 Scatter-Matrix Vector d ac cscat A cbs A rscat A rbs A 0 16 32 8 9 10 11 12 13 14 15 16 0 1 2 3 4 5 6 7 1 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 32 31 32 33 34 35 36 37 1, store stride for each block or zero for irregular blocks: Block-Scatter-Matrix Vector - vector load/store instructions for stride-1 index - vector gather/scatter instructions for stride-n index. 48 8 24 40 56 16 48 49 50 51 52 53 54 55 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 40 41 42 43 44 45 46 47 56 57 58 59 60 61 62 63 Devin A. Matthews. High-Performance Tensor Contraction without Transposition. Accepted in SISC. 19

Strassen s Algorithm for Tensor Contraction C += A B abc dca db c a b 3 7 11 15 19 23 27 31 2 6 10 14 18 22 26 30 1 5 9 13 17 21 25 29 0 4 8 12 16 20 24 28 35 39 43 47 51 55 59 63 c a d 48 49 50 51 52 53 54 55 32 33 34 35 36 37 38 39 16 17 18 19 20 21 22 23 0 1 2 3 4 5 6 7 56 57 58 59 60 61 62 63 0 8 16 24 32 40 48 56 1 9 17 25 33 41 49 57 2 10 18 26 34 42 50 58 3 11 19 27 35 43 51 59 34 38 42 46 50 54 58 62 33 37 41 45 49 53 57 61 32 36 40 44 48 52 56 60 40 41 42 43 44 45 46 47 24 25 26 27 28 29 30 31 8 9 10 11 12 13 14 15 d b 4 12 20 28 36 44 52 60 5 13 21 29 37 45 53 61 6 14 22 30 38 46 54 62 0 ac 0 4 8 12 16 20 24 28 0 1 2 3 4 5 6 7 7 15 23 31 39 47 55 63 C += A B b cscat C 0 4 8 12 16 20 24 28 cscat A cbs C 4 4 cbs rscat A C rbs C rscat A rbs A 0 ac d 0 1 2 3 4 5 6 7 1 1 cscat B cbs B rscat B rbs B 0 d 0 8 16 24 32 40 48 56 0 8 16 24 32 40 48 56 b 8 8 1 2 1 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 16 32 16 16 17 18 19 20 21 22 23 32 31 32 33 34 35 36 37 1 2 1 1 9 17 25 33 41 49 57 2 10 18 26 34 42 50 58 3 3 7 11 15 19 23 27 31 48 48 49 50 51 52 53 54 55 3 3 11 19 27 35 43 51 59 32 32 36 40 44 48 52 56 60 8 8 9 10 11 12 13 14 15 4 4 12 20 28 36 44 52 60 33 34 1 33 37 41 45 49 53 57 61 34 38 42 46 50 54 58 62 24 40 16 24 25 26 27 28 29 30 31 40 41 42 43 44 45 46 47 5 6 1 5 13 21 29 37 45 53 61 6 14 22 30 38 46 54 62 35 35 39 43 47 51 55 59 63 56 56 57 58 59 60 61 62 63 7 7 15 23 31 39 47 55 63 Jianyu Huang, Devin A. Matthews, and Robert A. van de Geijn. Strassen s Algorithm for Tensor Contraction. arxiv:1704.03092 (2017). 20

Modifications to GEMM Packing routines: M 0 := (A 00 +A 11 )(B 00 +B 11 ); C 00 += M 0 ; C 11 += M 0 ; Implicit tensor-to-matrix permutations Addition of submatrices of A and B. Micro-kernel: Implicit matrix-to-tensor transformations Scatter update of submatrices of C. Additional workspace for Transposition (Tensor Contraction) Additional Workspace for Summation (Strassen)

C += AB; M := (X+Y)(V+W); C += M; D += M; k C x n C L3 Cache m C x k C L2 Cache m R x n R Register 22

23

Variations on a theme Naïve Strassen AB Strassen ABC Strassen 24

Outline Background High-performance GEMM High-performance Strassen High-performance Tensor Contraction Strassen s Algorithm for Tensor Contraction Performance Model Experiments Conclusion 25

Performance Metric Performance Model Total Time Breakdown Arithmetic Operations Memory Operations 26

27

28

Outline Background High-performance GEMM High-performance Strassen High-performance Tensor Contraction Strassen s Algorithm for Tensor Contraction Performance Model Experiments Conclusion 29

Synthetic Tensor Contractions Intel Xeon E5-2680 v2 (Ivy Bridge, 10 core/socket) 30

Synthetic Tensor Contractions Intel Xeon E5-2680 v2 (Ivy Bridge, 10 core/socket) 31

Synthetic Tensor Contractions Intel Xeon E5-2680 v2 (Ivy Bridge, 10 core/socket) 32

Synthetic Tensor Contractions Intel Xeon E5-2680 v2 (Ivy Bridge, 10 core/socket) 33

Synthetic Tensor Contractions Intel Xeon E5-2680 v2 (Ivy Bridge, 10 core/socket)

Real-world Benchmark Intel Xeon E5-2680 v2 (Ivy Bridge, 10 core/socket) is denoted as Paul Springer, and Paolo Bientinesi. "Design of a high-performance GEMM-like tensor-tensor multiplication." arxiv preprint arxiv:1607.00145 (2016). 35

Distributed Memory Experiments Weak scalability performance result of the various implementations for a 4-D tensor contraction CCSD application on distributed memory: CTF shows the performance of the Cyclops Tensor Framework (linked with Intel MKL). Solomonik, Edgar, Devin Matthews, Jeff Hammond, and James Demmel. "Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions." In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pp. 813-824. IEEE, 2013. 36

Outline Background High-performance GEMM High-performance Strassen High-performance Tensor Contraction Strassen s Algorithm for Tensor Contraction Performance Model Experiments Conclusion 37

Conclusion First work to leverage Strassen s Algorithm for Tensor Contraction. Fusing matrix summation and permutations with packing and micro-kernel operations inside GEMM. Avoiding explicit transpositions and extra workspace, and reducing the overhead of memory movement. Achieving ~1.3x speedup on single core, multicore, and distributed memory parallel architectures. 38

Acknowledgement - NSF grants ACI-1550493, CCF-1714091. - A gift from Qualcomm Foundation. - Intel Corporation through an Intel Parallel Computing Center (IPCC). - Access to the Maverick and Stampede supercomputers administered by TACC. We thank Martin Schatz for his help with distributed memory implementations, and the rest of the SHPC team (http://shpc.ices.utexas.edu) for their supports. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. 39

The source code can be downloaded from: https://github.com/flame/tblis-strassen 40