Tensor Contractions with Extended BLAS Kernels on CPU and GPU
|
|
- Sherman French
- 6 years ago
- Views:
Transcription
1 Tensor Contractions with Extended BLAS Kernels on CPU and GPU Cris Cecka Senior Research Scientist NVIDIA Research, Santa Clara, California Joint work with Yang Shi, U.N. Niranjan, and Animashree Anandkumar Electrical Engineering and Computer Science University of California, Irvine Workshop on Batched, Reproducible, and Reduced Precision BLAS February 25, 2017 Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
2 Tensor Contraction-Motivation Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
3 Tensor Contraction-Motivation Modern data is inherently multi-dimensional Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
4 Tensor Contraction-Motivation Modern data is inherently multi-dimensional Input Hidden 1 Hidden 2 Output Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
5 Tensor Contraction-Motivation Modern data is inherently multi-dimensional Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
6 Tensor Contraction-Motivation What is tensor contraction? Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
7 Tensor Contraction-Motivation What is tensor contraction? C C = A A B B Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
8 Tensor Contraction-Motivation What is tensor contraction? C C = A A B B A 422 = B 21 = A(:,1,:) C 421 A(:,2,:) e.g. C mnp = A mnk B kp Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
9 Tensor Contraction-Motivation What is tensor contraction? C C = A A B B A 422 = A(:,1,:) A(:,2,:) B 21 = C 421 e.g. C mnp = A mnk B kp Why do we need tensor contraction? 1 Core primitive of multilinear algebra. 2 BLAS Level 3: Unbounded compute intensity. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
10 Tensor Contraction Motivation Lots of hot applications at the moment: Machine learning Deep learning e.g. Learning latent variable model with tensor decomposition: Topic model 1 Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
11 Tensor Contraction Motivation Lots of hot applications at the moment: Machine learning Deep learning e.g. Learning latent variable model with tensor decomposition: Topic model 1 h: PDF of topics in a document. A: Topic-word matrix. A ij = P(x m = i y m = j) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
12 Tensor Contraction Motivation Lots of hot applications at the moment: Machine learning Deep learning e.g. Learning latent variable model with tensor decomposition: Topic model 1 h: PDF of topics in a document. A: Topic-word matrix. A ij = P(x m = i y m = j) Form third-order tensor M 3 = E(x x x) = i h ia i a i a i 1 Tensor Decompositions for Learning Latent Variable Models, Anima Anandkumar, Rong Ge, Daniel Hsu et. al. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
13 Tensor Contraction Motivation Distributed FFT Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
14 Tensor Contraction Motivation Distributed FFT Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
15 Tensor Contraction-Motivation What do we have? Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
16 Tensor Contraction-Motivation What do we have? Tensor computation libraries 1 Arbitrary/restricted tensor operation of any order and dimension 1 Tensortoolbox (Matlab) 2 FTensor (C++) 3 Cyclops (C++) 4 BTAS (C++) 5 All the Python... Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
17 Tensor Contraction-Motivation What do we have? Tensor computation libraries 1 Arbitrary/restricted tensor operation of any order and dimension 1 Tensortoolbox (Matlab) 2 FTensor (C++) 3 Cyclops (C++) 4 BTAS (C++) 5 All the Python... Efficient computing frame 1 Static analysis solutions 1 PPCG [ISL] (polyhedral) 2 TCE (DSL) 2 Parallel and distributed primitives 1 BLAS, cublas 2 BLIS, BLASX, cublasxt Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
18 Tensor Contraction-Motivation Libraries Explicit permutation dominates. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
19 Tensor Contraction-Motivation Libraries Explicit permutation dominates. Consider C mnp = A km B pkn. 1 A km A mk 2 B pkn B kpn 3 C mnp C mpn 4 C m(pn) = A mk B k(pn) 5 C mpn C mnp Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
20 Tensor Contraction-Motivation Libraries Explicit permutation dominates. Consider C mnp = A km B pkn. 1 A km A mk 2 B pkn B kpn 3 C mnp C mpn 4 C m(pn) = A mk B k(pn) 5 C mpn C mnp (Top) CPU. (Bottom) GPU. The fraction of time spent in copies/transpositions. Lines are shown with 1, 2, 3, and 6 transpositions. n Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
21 Existing Primitives GEMM Suboptimal for many small matrices. Pointer-to-Pointer BatchedGEMM Available in MKL 11.3β and cublas 4.1 C[p] = α op(a[p]) op(b[p]) + β C[p] cublas<t>gemmbatched(cublashandle_t handle, cublasoperation_t transa, cublasoperation_t transb, int M, int N, int K, const T* alpha, const T** A, int lda, const T** B, int ldb, const T* beta, T** C, int ldc, int batchcount) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
22 Existing Primitives Pointer-to-Pointer BatchedGEMM Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
23 Existing Primitives Pointer-to-Pointer BatchedGEMM Except actually... Solution: StridedBatchedGEMM Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
24 StridedBatchedGEMM Exists!... Still no documentation?!? Documentation as of last Tuesday! Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
25 StridedBatchedGEMM Exists!... Still no documentation?!? Documentation as of last Tuesday! In cublas 8.0: $$ grep StridedBatched -A 17 /usr/local/cuda/include/cublas_api.h 2320:CUBLASAPI cublasstatus_t cublassgemmstridedbatched (cublashandle_t handle, cublasoperation_t transa, cublasoperation_t transb, int m, int n, int k, const float *alpha, // host or device pointer const float *A, int lda, long long int stridea, // purposely signed const float *B, int ldb, long long int strideb, const float *beta, // host or device pointer float *C, int ldc, long long int stridec, int batchcount);... Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
26 StridedBatchedGEMM cublas<t>gemmstridedbatched(cublashandle_t handle, cublasoperation_t transa, cublasoperation_t transb, int M, int N, int K, const T* alpha, const T* A, int lda1, int stridea, const T* B, int ldb1, int strideb, const T* beta, T* C, int ldc1, int stridec, int batchcount) Common use case for Pointer-to-pointer BatchedGEMM. No Pointer-to-pointer data structure or overhead. Performance on par with pure GEMM (P100 and beyond). Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
27 Tensor Contraction with Extended BLAS Primitives C mnp = A B C mnp C[m + n ldc1 + p ldc2] Case Contraction Kernel1 Kernel2 Case Contraction Kernel1 Kernel2 1.1 AmkBknp C m(np) = AmkB k(np) C mn[p] = AmkB kn[p] 4.1 AknBkmp C mn[p] = B km[p] Akn 1.2 AmkBkpn C mn[p] = AmkB k[p]n C m[n]p = AmkB kp[n] 4.2 AknBkpm C mn[p] = B k[p]m Akn 1.3 AmkBnkp C mn[p] = AmkB nk[p] 4.3 AknBmkp C mn[p] = B mk[p] Akn 1.4 AmkBpkn C m[n]p = AmkB pk[n] 4.4 AknBpkm 1.5 AmkBnpk C m(np) = AmkB (np)k C mn[p] = AmkB n[p]k 4.5 AknBmpk C mn[p] = B m[p]k Akn 1.6 AmkBpnk C m[n]p = AmkB p[n]k 4.6 AknBpmk 2.1 AkmBknp C m(np) = A km B k(np) C mn[p] = A km B kn[p] 5.1 ApkBkmn C (mn)p = B k(mn) A pk C m[n]p = B km[n] A pk 2.2 AkmBkpn C mn[p] = A km B k[p]n C m[n]p = A km B kp[n] 5.2 ApkBknm C m[n]p = B k[n]m A pk 2.3 AkmBnkp C mn[p] = A km B nk[p] 5.3 ApkBmkn C m[n]p = B mk[n] A pk 2.4 AkmBpkn C m[n]p = A km B pk[n] 5.4 ApkBnkm 2.5 AkmBnpk C m(np) = A km B (np)k C mn[p] = A km B n[p]k 5.5 ApkBmnk C (mn)p = B (mn)k A pk C m[n]p = B m[n]k A pk 2.6 AkmBpnk C m[n]p = A km B p[n]k 5.6 ApkBnmk 3.1 AnkBkmp C mn[p] = B km[p] A nk 6.1 AkpBkmn C (mn)p = B k(mn) Akp C m[n]p = B km[n] Akp 3.2 AnkBkpm C mn[p] = B k[p]m A nk 6.2 AkpBknm C m[n]p = B k[n]m Akp 3.3 AnkBmkp C mn[p] = B mk[p] A nk 6.3 AkpBmkn C m[n]p = B mk[n] Akp 3.4 AnkBpkm 6.4 AkpBnkm 3.5 AnkBmpk C mn[p] = B m[p]k A nk 6.5 AkpBmnk C (mn)p = B (mn)k Akp C m[n]p = B m[n]k Akp 3.6 AnkBpmk 6.6 AkpBnmk Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
28 Tensor Contraction with Extended BLAS Primitives Case Contraction Kernel1 Kernel2 Kernel3 1.1 A mk B knp C m(np) = A mk B k(np) C mn[p] = A mk B kn[p] C m[n]p = A mk B k[n]p 6.1 A kp B kmn C (mn)p = Bk(mn) A kp C m[n]p = Bkm[n] A kp Example: Mappings to Level 3 BLAS routines Case 1.1, Kernel2: C mn[p] = A mk B kn[p] cublasdgemmstridedbatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, M, N, K, 1.0, A, lda1, 0, B, ldb1, ldb2, 0.0, C, ldc1, ldc2, P) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
29 Tensor Contraction with Extended BLAS Primitives Case Contraction Kernel1 Kernel2 Kernel3 1.1 A mk B knp C m(np) = A mk B k(np) C mn[p] = A mk B kn[p] C m[n]p = A mk B k[n]p 6.1 A kp B kmn C (mn)p = Bk(mn) A kp C m[n]p = Bkm[n] A kp Example: Mappings to Level 3 BLAS routines Case 6.1, Kernel2: C m[n]p = B km[n] A kp cublasdgemmstridedbatched(handle, CUBLAS_OP_T, CUBLAS_OP_N, M, P, K, 1.0, B, ldb1, ldb2, A, lda1, 0, 0.0, C, ldc2, ldc1, N) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
30 Performance Flatten V.S. SBGEMM Flattening Speedup (Batch / Flat) Case 1.1 [n] Case 1.1 [p] Case 1.5 [p] Case 6.1 [n] n n Prefer flattening to pure GEMM. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
31 Performance Batching in last mode versus middle mode Last Mode Speedup ([n] / [p]) n Case 1.1 Case n On CPU: Prefer batching in the last mode. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
32 Performance Mixed mode batching Last Output Mode Speedup ([n] / [p]) n Case 1.2 Case n On CPU: mode of the output tensor is more important than the batching mode of the input tensor. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
33 Analysis Exceptional Cases: Cannot be computed by StridedBatchedGEMM. Case Contraction 3.4 C mnp = A nk B pkm 6.4 C mnp = A kp B nkm C mnp = A mkp B mkn C mnp = A pkm B nkp Example of exceptional cases. These cases are precisely the interleaved GEMMs. When batching index is the major index in an argument: That argument is interpreted as interleaved matrices. May be one or both inputs and/or output. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
34 3D Tiled GEMM Implement GEMM with a 3D tile: Transpositions performed on the way to smem/reg. Keep canonical GEMM core. Considers three modes rather than two: Major mode: A mnkpqr Reduction mode: A mnkpqr Aux (batch,row,col) mode: A mnkpqr (Optional) Third tile dimension interpolates between pure GEMM and interleaved GEMM. Nested loop over remaining modes performs full contraction. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
35 3D Tiled GEMM Tilesize tuning with PPCG for exceptional cases: 6.5 (128,1) Blocking (p, k) (64,1) (32,1) (16,1) (8,1) (4,1) (2,1) (1,1) (1,1) (2,1) (4,1) (8,1) (16,1) (32,1) (64,1) (128,1) Blocking (m, n) Time [µs] PPCG BATCHEDGEMV BATCHEDGEMM GEAM n Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
36 3D Tiled GEMM C mnp = A mkp B nkp : Increasing BLK P decreases effective tile size. C pmn = A pmk B pnk : Increasing BLK P increases cache line utilization. e.g. BLK P= 1, 2, 4, 8 BLK P = 1 equivalent to BLIS (strides in row and column) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
37 3D Tiled GEMM Interface? Extend the StridedBatchedGEMM transpose parameters? C mnp A pmk B pkn EX N EX N C mpn = A pmk B pnk EX N EX T C pmn A pkm B pkn EX T EX N A pkm B pnk EX T EX T Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
38 contract contract(cublas::par, alpha, A, {M,P,K}, _< m, p, k >, B, {K,N,P}, _< k, n, p >, beta, C, _< m, n, p >); ROWIDX COLIDX BATIDX REDIDX Kernel * dot XXX (c p = a p b p) XXX (c p = a pk b pk ) scal gemv dgmm batch gemm (XXX: exceptional) ger gemm XXX (batch gemm[k = 1]? batch ger?) batch gemm (XXX: exceptional) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
39 Applications: Tucker Decomposition T mnp = G ijk A mi B nj C pk Cpk T G mnp ijk A mi B nj Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
40 Applications: Tucker Decomposition T mnp = G ijk A mi B nj C pk T G mnp ijk A Cpk mi B nj Main steps in the algorithm Y mjk = T mnp Bnj t C pk t Y ink = T mnp A t+1 mi Cpk t Y ijp = T mnp B t+1 nj A t+1 mi Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
41 Applications: Tucker Decomposition Performance on Tucker decomposition: Time (sec) TensorToolbox BTAS Cyclops CPU Batched GPU Batched n Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
42 Applications: FFT Low-Communication FFT for multiple GPUs. StridedBatchedGEMM composes 75%+ of the runtime. Essential to the performance. Two custom kernels are precisely interleaved GEMMs. 2 P100 GPUs: 1.3x over cufftxt. 8 P100 GPUs: 2.1x over cufftxt. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
43 Conclusion StridedBatchedGEMM in cublas for generalized tensor contractions. Avoid explicit transpositions or permutations. 10x(GPU) and 2x(CPU) speedup on small/moderate sized tensors. Available in cublas 8.0 Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
44 Conclusion StridedBatchedGEMM in cublas for generalized tensor contractions. Avoid explicit transpositions or permutations. 10x(GPU) and 2x(CPU) speedup on small/moderate sized tensors. Available in cublas 8.0 Future work: Exceptional case kernels/performance/interface?? Library Optimizations Matrix stride zero Persistent Matrix Strided Batched GEMM Staged RNNs: Staged Persistent Matrix Strided Batched GEMM Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
45 Thank you! Questions? Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30
TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018
TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist, NVIDIA GTC 2018 Tensors Computations and the GPU AGENDA Tensor Networks and Decompositions Tensor Layers in
More informationEvaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries
Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou (ctheodos@grid.auth.gr) User and Application Support Scientific Computing Centre @ AUTH Presentation Outline
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More informationStrassen s Algorithm for Tensor Contraction
Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,
More informationarxiv: v1 [cs.lg] 26 Jul 2017
Tensor Regression Networks Jean Kossaifi Amazon AI Imperial College London jean.kossaifi@imperial.ac.uk Zachary Lipton Amazon AI University of California, San Diego zlipton@cs.ucsd.edu arxiv:1707.08308v1
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationCyclops Tensor Framework
Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r
More informationPerformance Prediction for Tensor Contractions
1 / 18 Performance Prediction for Tensor Contractions Paolo Bientinesi, Edoardo Di Napoli, Diego Fabregat, Elmar Peise AICES, RWTH Aachen pauldj@aices.rwth-aachen.de June 3rd, 214 PASCConference 14 Zürich,
More informationACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationUsing BLIS for tensor computations in Q-Chem
Using BLIS for tensor computations in Q-Chem Evgeny Epifanovsky Q-Chem BLIS Retreat, September 19 20, 2016 Q-Chem is an integrated software suite for modeling the properties of molecular systems from first
More informationLevel-3 BLAS on a GPU
Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón
More informationarxiv: v3 [cs.ms] 7 Nov 2017
A Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer, AICES, RWTH Aachen Paolo Bientinesi, AICES, RWTH Aachen arxiv:1607.00145v3 [cs.ms] 7 Nov 2017 We present GEMM-like Tensor-Tensor
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 212 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationA Design of a High-Performance GEMM-like Tensor-Tensor Multiplication
A Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer, AICES, RWTH Aachen Paolo Bientinesi, AICES, RWTH Aachen We present GEMM-like Tensor-Tensor multiplication (GETT), a
More informationMARCH 24-27, 2014 SAN JOSE, CA
MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationDictionary Learning Using Tensor Methods
Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone
More informationMagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs
MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,
More informationarxiv: v1 [cs.ce] 31 Oct 2016
Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods arxiv:1611.00606v1 [cs.ce] 31 Oct 2016 Diego Fabregat-Traver, Davor Davidović, Markus Höhnerbach, and Edoardo di Napoli
More informationTensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia
Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature
More informationCS 542G: Conditioning, BLAS, LU Factorization
CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles
More informationAMLD Deep Learning in PyTorch. 2. PyTorch tensors
AMLD Deep Learning in PyTorch 2. PyTorch tensors François Fleuret http://fleuret.org/amld/ February 10, 2018 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE PyTorch s tensors François Fleuret AMLD Deep Learning
More informationTips Geared Towards R. Adam J. Suarez. Arpil 10, 2015
Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done
More informationJacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA
Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is
More informationNon-convex Robust PCA: Provable Bounds
Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing
More informationEmpowering Scientists with Domain Specific Languages
Empowering Scientists with Domain Specific Languages Julian Kunkel, Nabeeh Jum ah Scientific Computing Department of Informatics University of Hamburg SciCADE2017 2017-09-13 Outline 1 Developing Scientific
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationTargeting Extreme Scale Computational Challenges with Heterogeneous Systems
Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1 Introduction! PNNL Laboratory Directed Research &
More informationarxiv: v2 [cs.lg] 9 May 2018
TensorLy: Tensor Learning in Python arxiv:1610.09555v2 [cs.lg] 9 May 2018 Jean Kossaifi 1 jean.kossaifi@imperial.ac.uk Yannis Panagakis 1,2 i.panagakis@imperial.ac.uk Anima Anandkumar 3,4 anima@amazon.com
More informationTensor Network Computations in Quantum Chemistry. Charles F. Van Loan Department of Computer Science Cornell University
Tensor Network Computations in Quantum Chemistry Charles F. Van Loan Department of Computer Science Cornell University Joint work with Garnet Chan, Department of Chemistry and Chemical Biology, Cornell
More informationLecture 4. Tensor-Related Singular Value Decompositions. Charles F. Van Loan
From Matrix to Tensor: The Transition to Numerical Multilinear Algebra Lecture 4. Tensor-Related Singular Value Decompositions Charles F. Van Loan Cornell University The Gene Golub SIAM Summer School 2010
More informationINITIAL INTEGRATION AND EVALUATION
INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52
More informationPractical Combustion Kinetics with CUDA
Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides
More informationTowards Fast, Accurate and Reproducible LU Factorization
Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3
More informationOrthogonal tensor decomposition
Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.
More informationHIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU
April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th Brief Introduction of CTC AGENDA Alpha/Beta Matrix
More informationTensor networks and deep learning
Tensor networks and deep learning I. Oseledets, A. Cichocki Skoltech, Moscow 26 July 2017 What is a tensor Tensor is d-dimensional array: A(i 1,, i d ) Why tensors Many objects in machine learning can
More informationMAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors
MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:
More informationLearning Deep Architectures for AI. Part II - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model
More informationThe Solution of a FEM Equation in Frequency Domain Using a Parallel Computing with CUBLAS
The Solution of a FEM Equation in Frequency Domain Using a Parallel Computing with CUBLAS R. Dominguez 1, A. Medina 1, and A. Ramos-Paz 1 1 Facultad de Ingeniería Eléctrica, División de Estudios de Posgrado,
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More information22 : Hilbert Space Embeddings of Distributions
10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation
More informationSolving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing
Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing Based on 2016v slides by Eivind Fonn NTNU, IMF February 27. 2017 1 The Poisson problem The Poisson equation is an elliptic partial
More informationECE521 W17 Tutorial 1. Renjie Liao & Min Bai
ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review
More informationECE 598: Representation Learning: Algorithms and Models Fall 2017
ECE 598: Representation Learning: Algorithms and Models Fall 2017 Lecture 1: Tensor Methods in Machine Learning Lecturer: Pramod Viswanathan Scribe: Bharath V Raghavan, Oct 3, 2017 11 Introduction Tensors
More informationTensor Methods for Feature Learning
Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to
More informationLecture: Numerical Linear Algebra Background
1/36 Lecture: Numerical Linear Algebra Background http://bicmr.pku.edu.cn/~wenzw/opt-2017-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s and Michael Grant s lecture notes
More informationEnhancing Performance of Tall-Skinny QR Factorization using FPGAs
Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common
More informationEfficient CP-ALS and Reconstruction From CP
Efficient CP-ALS and Reconstruction From CP Jed A. Duersch & Tamara G. Kolda Sandia National Laboratories Livermore, CA Sandia National Laboratories is a multimission laboratory managed and operated by
More informationIdentifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints
Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Anima Anandkumar U.C. Irvine Joint work with Daniel Hsu, Majid Janzamin Adel Javanmard and Sham Kakade.
More informationMatrix-Matrix Multiplication
Week5 Matrix-Matrix Multiplication 51 Opening Remarks 511 Composing Rotations Homework 5111 Which of the following statements are true: cosρ + σ + τ cosτ sinτ cosρ + σ sinρ + σ + τ sinτ cosτ sinρ + σ cosρ
More informationRWTH Aachen University
IPCC @ RWTH Aachen University Optimization of multibody and long-range solvers in LAMMPS Rodrigo Canales William McDoniel Markus Höhnerbach Ahmed E. Ismail Paolo Bientinesi IPCC Showcase November 2016
More informationACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS
ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS Alan Gray, Developer Technology Engineer, NVIDIA ECMWF 18th Workshop on high performance computing in meteorology, 28 th September 2018 ESCAPE NVIDIA s
More informationLinear Algebra (Review) Volker Tresp 2017
Linear Algebra (Review) Volker Tresp 2017 1 Vectors k is a scalar (a number) c is a column vector. Thus in two dimensions, c = ( c1 c 2 ) (Advanced: More precisely, a vector is defined in a vector space.
More informationIlya A. Kaliman* and Anna I. Krylov. Introduction
SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG New Algorithm for Tensor Contractions on Multi-Core CPUs, GPUs, and Accelerators Enables CCSD and EOM-CCSD Calculations with over 1000 Basis Functions on a Single
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationCVPR A New Tensor Algebra - Tutorial. July 26, 2017
CVPR 2017 A New Tensor Algebra - Tutorial Lior Horesh lhoresh@us.ibm.com Misha Kilmer misha.kilmer@tufts.edu July 26, 2017 Outline Motivation Background and notation New t-product and associated algebraic
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationCME 335 Project Final Report: A nested dissection implementation with updating
CME 335 Project Final Report: A nested dissection implementation with updating Victor Minden March 18, 2015 1 Introduction Many partial differential equations (PDEs) arising from mathematical models of
More informationFEAST at Play: Feature ExtrAction using Score function Tensors
JMLR: Workshop and Conference Proceedings 44 (2015) 130-144 NIPS 2015 The 1st International Workshop Feature Extraction: Modern Questions and Challenges FEAST at Play: Feature ExtrAction using Score function
More informationLinear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4
Linear Algebra Section. : LU Decomposition Section. : Permutations and transposes Wednesday, February 1th Math 01 Week # 1 The LU Decomposition We learned last time that we can factor a invertible matrix
More informationSpectral Learning of Hidden Markov Models with Group Persistence
000 00 00 00 00 00 00 007 008 009 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 Abstract In this paper, we develop a general Method of Moments
More informationBLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product
Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8
More information26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.
10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with
More informationI Don t Care about BLAS
I Don t Care about BLAS Devin Matthews Institute for Computational Engineering and Sciences, UT Austin #smallmoleculesmatter BLIS Retreat, Sept. 18, 217 1 From QC to DGEMM H ˆR i = E ˆR i Simple eigenproblem
More informationDynamic Scheduling within MAGMA
Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationAppendix A. Proof to Theorem 1
Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation
More informationBOOLEAN TENSOR FACTORIZATIONS. Pauli Miettinen 14 December 2011
BOOLEAN TENSOR FACTORIZATIONS Pauli Miettinen 14 December 2011 BACKGROUND: TENSORS AND TENSOR FACTORIZATIONS X BACKGROUND: TENSORS AND TENSOR FACTORIZATIONS C X A B BACKGROUND: BOOLEAN MATRIX FACTORIZATIONS
More informationB553 Lecture 5: Matrix Algebra Review
B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations
More informationThe Convolution Operation
The Convolution Operation Sargur Srihari srihari@buffalo.edu 1 Topics in Convolutional Networks Overview 1. The Convolution Operation 2. Motivation 3. Pooling 4. Convolution and Pooling as an Infinitely
More informationMatrix-Matrix Multiplication
Chapter Matrix-Matrix Multiplication In this chapter, we discuss matrix-matrix multiplication We start by motivating its definition Next, we discuss why its implementation inherently allows high performance
More informationBalanced Dense Polynomial Multiplication on Multicores
Balanced Dense Polynomial Multiplication on Multicores Yuzhen Xie SuperTech Group, CSAIL MIT joint work with Marc Moreno Maza ORCCA, UWO ACA09, Montreal, June 26, 2009 Introduction Motivation: Multicore-enabling
More informationBachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL
Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL Andreas Falkenstrøm Mieritz s093065 September 13, 2012 Supervisors: Allan Ensig-Peter Karup Bernd Dammann IMM-B.Sc.-2012-30 Contents 1 Problem
More informationDiscovering Fast Matrix Multiplication Algorithms via Tensor Decomposition
Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition Grey Ballard SIAM onference on omputational Science & Engineering March, 207 ollaborators This is joint work with... Austin Benson
More informationPerformance optimization of WEST and Qbox on Intel Knights Landing
Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University
More information[POLS 8500] Review of Linear Algebra, Probability and Information Theory
[POLS 8500] Review of Linear Algebra, Probability and Information Theory Professor Jason Anastasopoulos ljanastas@uga.edu January 12, 2017 For today... Basic linear algebra. Basic probability. Programming
More informationSymmetric rank-2k update on GPUs and/or multi-cores
Symmetric rank-2k update on GPUs and/or multi-cores Assignment 2, 5DV050, Spring 2012 Due on May 4 (soft) or May 11 (hard) at 16.00 Version 1.0 1 Background and motivation Quoting from Beresford N. Parlett's
More informationScalar and Tensor Parameters for Importing Tensor Index Notation including Einstein Summation Notation
1 Scalar and Tensor Parameters for Importing Tensor Index Notation including Einstein Summation Notation SATOSHI EGI, Rakuten Institute of Technology In this paper, we propose a method for importing tensor
More informationDirect and Incomplete Cholesky Factorizations with Static Supernodes
Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)
More informationBreaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points
Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points Michael Griebel Christian Rieger Peter Zaspel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität
More informationSection Partitioned Matrices and LU Factorization
Section 2.4 2.5 Partitioned Matrices and LU Factorization Gexin Yu gyu@wm.edu College of William and Mary partition matrices into blocks In real world problems, systems can have huge numbers of equations
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,
More informationParallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems
More informationPerformance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs
Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Théo Mary, Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Jack Dongarra presenter 1 Low-Rank
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationICS 6N Computational Linear Algebra Matrix Algebra
ICS 6N Computational Linear Algebra Matrix Algebra Xiaohui Xie University of California, Irvine xhx@uci.edu February 2, 2017 Xiaohui Xie (UCI) ICS 6N February 2, 2017 1 / 24 Matrix Consider an m n matrix
More informationNote Set 3 Numerical Linear Algebra
Note Set 3 Numerical Linear Algebra 3.1 Overview Numerical linear algebra includes a class of problems which involves multiplying vectors and matrices, solving linear equations, computing the inverse of
More informationEE-559 Deep learning 1.4. Tensor basics and linear regression. François Fleuret Mon Mar 18 20:04:34 UTC 2019
EE-559 Deep learning 1.4. Tensor basics and linear regression François Fleuret https://fleuret.org/ee559/ Mon Mar 18 20:04:34 UTC 2019 A tensor is a generalized matrix, a finite table of numerical values
More informationAccelerating and Scaling Lanczos Diagonalization with GPGPU
Accelerating and Scaling Lanczos Diagonalization with GPGPU Bill Brouwer, Filippo Spiga, Pierre-Yves Taunay, Sreejith GJ Nvidia GTC 2013 Outline Introduction Applications QE FQHE Theory Diagonalization
More informationOptimizing Time Integration of Chemical-Kinetic Networks for Speed and Accuracy
Paper # 070RK-0363 Topic: Reaction Kinetics 8 th U. S. National Combustion Meeting Organized by the Western States Section of the Combustion Institute and hosted by the University of Utah May 19-22, 2013
More informationAlgorithms and Methods for Fast Model Predictive Control
Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model
More information