Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Size: px
Start display at page:

Download "Tensor Contractions with Extended BLAS Kernels on CPU and GPU"

Transcription

1 Tensor Contractions with Extended BLAS Kernels on CPU and GPU Cris Cecka Senior Research Scientist NVIDIA Research, Santa Clara, California Joint work with Yang Shi, U.N. Niranjan, and Animashree Anandkumar Electrical Engineering and Computer Science University of California, Irvine Workshop on Batched, Reproducible, and Reduced Precision BLAS February 25, 2017 Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

2 Tensor Contraction-Motivation Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

3 Tensor Contraction-Motivation Modern data is inherently multi-dimensional Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

4 Tensor Contraction-Motivation Modern data is inherently multi-dimensional Input Hidden 1 Hidden 2 Output Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

5 Tensor Contraction-Motivation Modern data is inherently multi-dimensional Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

6 Tensor Contraction-Motivation What is tensor contraction? Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

7 Tensor Contraction-Motivation What is tensor contraction? C C = A A B B Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

8 Tensor Contraction-Motivation What is tensor contraction? C C = A A B B A 422 = B 21 = A(:,1,:) C 421 A(:,2,:) e.g. C mnp = A mnk B kp Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

9 Tensor Contraction-Motivation What is tensor contraction? C C = A A B B A 422 = A(:,1,:) A(:,2,:) B 21 = C 421 e.g. C mnp = A mnk B kp Why do we need tensor contraction? 1 Core primitive of multilinear algebra. 2 BLAS Level 3: Unbounded compute intensity. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

10 Tensor Contraction Motivation Lots of hot applications at the moment: Machine learning Deep learning e.g. Learning latent variable model with tensor decomposition: Topic model 1 Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

11 Tensor Contraction Motivation Lots of hot applications at the moment: Machine learning Deep learning e.g. Learning latent variable model with tensor decomposition: Topic model 1 h: PDF of topics in a document. A: Topic-word matrix. A ij = P(x m = i y m = j) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

12 Tensor Contraction Motivation Lots of hot applications at the moment: Machine learning Deep learning e.g. Learning latent variable model with tensor decomposition: Topic model 1 h: PDF of topics in a document. A: Topic-word matrix. A ij = P(x m = i y m = j) Form third-order tensor M 3 = E(x x x) = i h ia i a i a i 1 Tensor Decompositions for Learning Latent Variable Models, Anima Anandkumar, Rong Ge, Daniel Hsu et. al. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

13 Tensor Contraction Motivation Distributed FFT Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

14 Tensor Contraction Motivation Distributed FFT Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

15 Tensor Contraction-Motivation What do we have? Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

16 Tensor Contraction-Motivation What do we have? Tensor computation libraries 1 Arbitrary/restricted tensor operation of any order and dimension 1 Tensortoolbox (Matlab) 2 FTensor (C++) 3 Cyclops (C++) 4 BTAS (C++) 5 All the Python... Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

17 Tensor Contraction-Motivation What do we have? Tensor computation libraries 1 Arbitrary/restricted tensor operation of any order and dimension 1 Tensortoolbox (Matlab) 2 FTensor (C++) 3 Cyclops (C++) 4 BTAS (C++) 5 All the Python... Efficient computing frame 1 Static analysis solutions 1 PPCG [ISL] (polyhedral) 2 TCE (DSL) 2 Parallel and distributed primitives 1 BLAS, cublas 2 BLIS, BLASX, cublasxt Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

18 Tensor Contraction-Motivation Libraries Explicit permutation dominates. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

19 Tensor Contraction-Motivation Libraries Explicit permutation dominates. Consider C mnp = A km B pkn. 1 A km A mk 2 B pkn B kpn 3 C mnp C mpn 4 C m(pn) = A mk B k(pn) 5 C mpn C mnp Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

20 Tensor Contraction-Motivation Libraries Explicit permutation dominates. Consider C mnp = A km B pkn. 1 A km A mk 2 B pkn B kpn 3 C mnp C mpn 4 C m(pn) = A mk B k(pn) 5 C mpn C mnp (Top) CPU. (Bottom) GPU. The fraction of time spent in copies/transpositions. Lines are shown with 1, 2, 3, and 6 transpositions. n Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

21 Existing Primitives GEMM Suboptimal for many small matrices. Pointer-to-Pointer BatchedGEMM Available in MKL 11.3β and cublas 4.1 C[p] = α op(a[p]) op(b[p]) + β C[p] cublas<t>gemmbatched(cublashandle_t handle, cublasoperation_t transa, cublasoperation_t transb, int M, int N, int K, const T* alpha, const T** A, int lda, const T** B, int ldb, const T* beta, T** C, int ldc, int batchcount) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

22 Existing Primitives Pointer-to-Pointer BatchedGEMM Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

23 Existing Primitives Pointer-to-Pointer BatchedGEMM Except actually... Solution: StridedBatchedGEMM Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

24 StridedBatchedGEMM Exists!... Still no documentation?!? Documentation as of last Tuesday! Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

25 StridedBatchedGEMM Exists!... Still no documentation?!? Documentation as of last Tuesday! In cublas 8.0: $$ grep StridedBatched -A 17 /usr/local/cuda/include/cublas_api.h 2320:CUBLASAPI cublasstatus_t cublassgemmstridedbatched (cublashandle_t handle, cublasoperation_t transa, cublasoperation_t transb, int m, int n, int k, const float *alpha, // host or device pointer const float *A, int lda, long long int stridea, // purposely signed const float *B, int ldb, long long int strideb, const float *beta, // host or device pointer float *C, int ldc, long long int stridec, int batchcount);... Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

26 StridedBatchedGEMM cublas<t>gemmstridedbatched(cublashandle_t handle, cublasoperation_t transa, cublasoperation_t transb, int M, int N, int K, const T* alpha, const T* A, int lda1, int stridea, const T* B, int ldb1, int strideb, const T* beta, T* C, int ldc1, int stridec, int batchcount) Common use case for Pointer-to-pointer BatchedGEMM. No Pointer-to-pointer data structure or overhead. Performance on par with pure GEMM (P100 and beyond). Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

27 Tensor Contraction with Extended BLAS Primitives C mnp = A B C mnp C[m + n ldc1 + p ldc2] Case Contraction Kernel1 Kernel2 Case Contraction Kernel1 Kernel2 1.1 AmkBknp C m(np) = AmkB k(np) C mn[p] = AmkB kn[p] 4.1 AknBkmp C mn[p] = B km[p] Akn 1.2 AmkBkpn C mn[p] = AmkB k[p]n C m[n]p = AmkB kp[n] 4.2 AknBkpm C mn[p] = B k[p]m Akn 1.3 AmkBnkp C mn[p] = AmkB nk[p] 4.3 AknBmkp C mn[p] = B mk[p] Akn 1.4 AmkBpkn C m[n]p = AmkB pk[n] 4.4 AknBpkm 1.5 AmkBnpk C m(np) = AmkB (np)k C mn[p] = AmkB n[p]k 4.5 AknBmpk C mn[p] = B m[p]k Akn 1.6 AmkBpnk C m[n]p = AmkB p[n]k 4.6 AknBpmk 2.1 AkmBknp C m(np) = A km B k(np) C mn[p] = A km B kn[p] 5.1 ApkBkmn C (mn)p = B k(mn) A pk C m[n]p = B km[n] A pk 2.2 AkmBkpn C mn[p] = A km B k[p]n C m[n]p = A km B kp[n] 5.2 ApkBknm C m[n]p = B k[n]m A pk 2.3 AkmBnkp C mn[p] = A km B nk[p] 5.3 ApkBmkn C m[n]p = B mk[n] A pk 2.4 AkmBpkn C m[n]p = A km B pk[n] 5.4 ApkBnkm 2.5 AkmBnpk C m(np) = A km B (np)k C mn[p] = A km B n[p]k 5.5 ApkBmnk C (mn)p = B (mn)k A pk C m[n]p = B m[n]k A pk 2.6 AkmBpnk C m[n]p = A km B p[n]k 5.6 ApkBnmk 3.1 AnkBkmp C mn[p] = B km[p] A nk 6.1 AkpBkmn C (mn)p = B k(mn) Akp C m[n]p = B km[n] Akp 3.2 AnkBkpm C mn[p] = B k[p]m A nk 6.2 AkpBknm C m[n]p = B k[n]m Akp 3.3 AnkBmkp C mn[p] = B mk[p] A nk 6.3 AkpBmkn C m[n]p = B mk[n] Akp 3.4 AnkBpkm 6.4 AkpBnkm 3.5 AnkBmpk C mn[p] = B m[p]k A nk 6.5 AkpBmnk C (mn)p = B (mn)k Akp C m[n]p = B m[n]k Akp 3.6 AnkBpmk 6.6 AkpBnmk Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

28 Tensor Contraction with Extended BLAS Primitives Case Contraction Kernel1 Kernel2 Kernel3 1.1 A mk B knp C m(np) = A mk B k(np) C mn[p] = A mk B kn[p] C m[n]p = A mk B k[n]p 6.1 A kp B kmn C (mn)p = Bk(mn) A kp C m[n]p = Bkm[n] A kp Example: Mappings to Level 3 BLAS routines Case 1.1, Kernel2: C mn[p] = A mk B kn[p] cublasdgemmstridedbatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, M, N, K, 1.0, A, lda1, 0, B, ldb1, ldb2, 0.0, C, ldc1, ldc2, P) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

29 Tensor Contraction with Extended BLAS Primitives Case Contraction Kernel1 Kernel2 Kernel3 1.1 A mk B knp C m(np) = A mk B k(np) C mn[p] = A mk B kn[p] C m[n]p = A mk B k[n]p 6.1 A kp B kmn C (mn)p = Bk(mn) A kp C m[n]p = Bkm[n] A kp Example: Mappings to Level 3 BLAS routines Case 6.1, Kernel2: C m[n]p = B km[n] A kp cublasdgemmstridedbatched(handle, CUBLAS_OP_T, CUBLAS_OP_N, M, P, K, 1.0, B, ldb1, ldb2, A, lda1, 0, 0.0, C, ldc2, ldc1, N) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

30 Performance Flatten V.S. SBGEMM Flattening Speedup (Batch / Flat) Case 1.1 [n] Case 1.1 [p] Case 1.5 [p] Case 6.1 [n] n n Prefer flattening to pure GEMM. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

31 Performance Batching in last mode versus middle mode Last Mode Speedup ([n] / [p]) n Case 1.1 Case n On CPU: Prefer batching in the last mode. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

32 Performance Mixed mode batching Last Output Mode Speedup ([n] / [p]) n Case 1.2 Case n On CPU: mode of the output tensor is more important than the batching mode of the input tensor. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

33 Analysis Exceptional Cases: Cannot be computed by StridedBatchedGEMM. Case Contraction 3.4 C mnp = A nk B pkm 6.4 C mnp = A kp B nkm C mnp = A mkp B mkn C mnp = A pkm B nkp Example of exceptional cases. These cases are precisely the interleaved GEMMs. When batching index is the major index in an argument: That argument is interpreted as interleaved matrices. May be one or both inputs and/or output. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

34 3D Tiled GEMM Implement GEMM with a 3D tile: Transpositions performed on the way to smem/reg. Keep canonical GEMM core. Considers three modes rather than two: Major mode: A mnkpqr Reduction mode: A mnkpqr Aux (batch,row,col) mode: A mnkpqr (Optional) Third tile dimension interpolates between pure GEMM and interleaved GEMM. Nested loop over remaining modes performs full contraction. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

35 3D Tiled GEMM Tilesize tuning with PPCG for exceptional cases: 6.5 (128,1) Blocking (p, k) (64,1) (32,1) (16,1) (8,1) (4,1) (2,1) (1,1) (1,1) (2,1) (4,1) (8,1) (16,1) (32,1) (64,1) (128,1) Blocking (m, n) Time [µs] PPCG BATCHEDGEMV BATCHEDGEMM GEAM n Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

36 3D Tiled GEMM C mnp = A mkp B nkp : Increasing BLK P decreases effective tile size. C pmn = A pmk B pnk : Increasing BLK P increases cache line utilization. e.g. BLK P= 1, 2, 4, 8 BLK P = 1 equivalent to BLIS (strides in row and column) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

37 3D Tiled GEMM Interface? Extend the StridedBatchedGEMM transpose parameters? C mnp A pmk B pkn EX N EX N C mpn = A pmk B pnk EX N EX T C pmn A pkm B pkn EX T EX N A pkm B pnk EX T EX T Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

38 contract contract(cublas::par, alpha, A, {M,P,K}, _< m, p, k >, B, {K,N,P}, _< k, n, p >, beta, C, _< m, n, p >); ROWIDX COLIDX BATIDX REDIDX Kernel * dot XXX (c p = a p b p) XXX (c p = a pk b pk ) scal gemv dgmm batch gemm (XXX: exceptional) ger gemm XXX (batch gemm[k = 1]? batch ger?) batch gemm (XXX: exceptional) Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

39 Applications: Tucker Decomposition T mnp = G ijk A mi B nj C pk Cpk T G mnp ijk A mi B nj Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

40 Applications: Tucker Decomposition T mnp = G ijk A mi B nj C pk T G mnp ijk A Cpk mi B nj Main steps in the algorithm Y mjk = T mnp Bnj t C pk t Y ink = T mnp A t+1 mi Cpk t Y ijp = T mnp B t+1 nj A t+1 mi Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

41 Applications: Tucker Decomposition Performance on Tucker decomposition: Time (sec) TensorToolbox BTAS Cyclops CPU Batched GPU Batched n Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

42 Applications: FFT Low-Communication FFT for multiple GPUs. StridedBatchedGEMM composes 75%+ of the runtime. Essential to the performance. Two custom kernels are precisely interleaved GEMMs. 2 P100 GPUs: 1.3x over cufftxt. 8 P100 GPUs: 2.1x over cufftxt. Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

43 Conclusion StridedBatchedGEMM in cublas for generalized tensor contractions. Avoid explicit transpositions or permutations. 10x(GPU) and 2x(CPU) speedup on small/moderate sized tensors. Available in cublas 8.0 Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

44 Conclusion StridedBatchedGEMM in cublas for generalized tensor contractions. Avoid explicit transpositions or permutations. 10x(GPU) and 2x(CPU) speedup on small/moderate sized tensors. Available in cublas 8.0 Future work: Exceptional case kernels/performance/interface?? Library Optimizations Matrix stride zero Persistent Matrix Strided Batched GEMM Staged RNNs: Staged Persistent Matrix Strided Batched GEMM Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

45 Thank you! Questions? Cris Cecka (NVIDIA) Tensor Contractions cublas February 25, / 30

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018 TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist, NVIDIA GTC 2018 Tensors Computations and the GPU AGENDA Tensor Networks and Decompositions Tensor Layers in

More information

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou (ctheodos@grid.auth.gr) User and Application Support Scientific Computing Centre @ AUTH Presentation Outline

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

Strassen s Algorithm for Tensor Contraction

Strassen s Algorithm for Tensor Contraction Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,

More information

arxiv: v1 [cs.lg] 26 Jul 2017

arxiv: v1 [cs.lg] 26 Jul 2017 Tensor Regression Networks Jean Kossaifi Amazon AI Imperial College London jean.kossaifi@imperial.ac.uk Zachary Lipton Amazon AI University of California, San Diego zlipton@cs.ucsd.edu arxiv:1707.08308v1

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

Cyclops Tensor Framework

Cyclops Tensor Framework Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r

More information

Performance Prediction for Tensor Contractions

Performance Prediction for Tensor Contractions 1 / 18 Performance Prediction for Tensor Contractions Paolo Bientinesi, Edoardo Di Napoli, Diego Fabregat, Elmar Peise AICES, RWTH Aachen pauldj@aices.rwth-aachen.de June 3rd, 214 PASCConference 14 Zürich,

More information

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

Using BLIS for tensor computations in Q-Chem

Using BLIS for tensor computations in Q-Chem Using BLIS for tensor computations in Q-Chem Evgeny Epifanovsky Q-Chem BLIS Retreat, September 19 20, 2016 Q-Chem is an integrated software suite for modeling the properties of molecular systems from first

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

arxiv: v3 [cs.ms] 7 Nov 2017

arxiv: v3 [cs.ms] 7 Nov 2017 A Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer, AICES, RWTH Aachen Paolo Bientinesi, AICES, RWTH Aachen arxiv:1607.00145v3 [cs.ms] 7 Nov 2017 We present GEMM-like Tensor-Tensor

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint: Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 212 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University

More information

A Design of a High-Performance GEMM-like Tensor-Tensor Multiplication

A Design of a High-Performance GEMM-like Tensor-Tensor Multiplication A Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer, AICES, RWTH Aachen Paolo Bientinesi, AICES, RWTH Aachen We present GEMM-like Tensor-Tensor multiplication (GETT), a

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Dictionary Learning Using Tensor Methods

Dictionary Learning Using Tensor Methods Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone

More information

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,

More information

arxiv: v1 [cs.ce] 31 Oct 2016

arxiv: v1 [cs.ce] 31 Oct 2016 Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods arxiv:1611.00606v1 [cs.ce] 31 Oct 2016 Diego Fabregat-Traver, Davor Davidović, Markus Höhnerbach, and Edoardo di Napoli

More information

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature

More information

CS 542G: Conditioning, BLAS, LU Factorization

CS 542G: Conditioning, BLAS, LU Factorization CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles

More information

AMLD Deep Learning in PyTorch. 2. PyTorch tensors

AMLD Deep Learning in PyTorch. 2. PyTorch tensors AMLD Deep Learning in PyTorch 2. PyTorch tensors François Fleuret http://fleuret.org/amld/ February 10, 2018 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE PyTorch s tensors François Fleuret AMLD Deep Learning

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Non-convex Robust PCA: Provable Bounds

Non-convex Robust PCA: Provable Bounds Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing

More information

Empowering Scientists with Domain Specific Languages

Empowering Scientists with Domain Specific Languages Empowering Scientists with Domain Specific Languages Julian Kunkel, Nabeeh Jum ah Scientific Computing Department of Informatics University of Hamburg SciCADE2017 2017-09-13 Outline 1 Developing Scientific

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1 Introduction! PNNL Laboratory Directed Research &

More information

arxiv: v2 [cs.lg] 9 May 2018

arxiv: v2 [cs.lg] 9 May 2018 TensorLy: Tensor Learning in Python arxiv:1610.09555v2 [cs.lg] 9 May 2018 Jean Kossaifi 1 jean.kossaifi@imperial.ac.uk Yannis Panagakis 1,2 i.panagakis@imperial.ac.uk Anima Anandkumar 3,4 anima@amazon.com

More information

Tensor Network Computations in Quantum Chemistry. Charles F. Van Loan Department of Computer Science Cornell University

Tensor Network Computations in Quantum Chemistry. Charles F. Van Loan Department of Computer Science Cornell University Tensor Network Computations in Quantum Chemistry Charles F. Van Loan Department of Computer Science Cornell University Joint work with Garnet Chan, Department of Chemistry and Chemical Biology, Cornell

More information

Lecture 4. Tensor-Related Singular Value Decompositions. Charles F. Van Loan

Lecture 4. Tensor-Related Singular Value Decompositions. Charles F. Van Loan From Matrix to Tensor: The Transition to Numerical Multilinear Algebra Lecture 4. Tensor-Related Singular Value Decompositions Charles F. Van Loan Cornell University The Gene Golub SIAM Summer School 2010

More information

INITIAL INTEGRATION AND EVALUATION

INITIAL INTEGRATION AND EVALUATION INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52

More information

Practical Combustion Kinetics with CUDA

Practical Combustion Kinetics with CUDA Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides

More information

Towards Fast, Accurate and Reproducible LU Factorization

Towards Fast, Accurate and Reproducible LU Factorization Towards Fast, Accurate and Reproducible LU Factorization Roman Iakymchuk 1, David Defour 2, and Stef Graillat 3 1 KTH Royal Institute of Technology, CSC, CST/PDC 2 Université de Perpignan, DALI LIRMM 3

More information

Orthogonal tensor decomposition

Orthogonal tensor decomposition Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.

More information

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th Brief Introduction of CTC AGENDA Alpha/Beta Matrix

More information

Tensor networks and deep learning

Tensor networks and deep learning Tensor networks and deep learning I. Oseledets, A. Cichocki Skoltech, Moscow 26 July 2017 What is a tensor Tensor is d-dimensional array: A(i 1,, i d ) Why tensors Many objects in machine learning can

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Learning Deep Architectures for AI. Part II - Vijay Chakilam Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model

More information

The Solution of a FEM Equation in Frequency Domain Using a Parallel Computing with CUBLAS

The Solution of a FEM Equation in Frequency Domain Using a Parallel Computing with CUBLAS The Solution of a FEM Equation in Frequency Domain Using a Parallel Computing with CUBLAS R. Dominguez 1, A. Medina 1, and A. Ramos-Paz 1 1 Facultad de Ingeniería Eléctrica, División de Estudios de Posgrado,

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

22 : Hilbert Space Embeddings of Distributions

22 : Hilbert Space Embeddings of Distributions 10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation

More information

Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing

Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing Based on 2016v slides by Eivind Fonn NTNU, IMF February 27. 2017 1 The Poisson problem The Poisson equation is an elliptic partial

More information

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review

More information

ECE 598: Representation Learning: Algorithms and Models Fall 2017

ECE 598: Representation Learning: Algorithms and Models Fall 2017 ECE 598: Representation Learning: Algorithms and Models Fall 2017 Lecture 1: Tensor Methods in Machine Learning Lecturer: Pramod Viswanathan Scribe: Bharath V Raghavan, Oct 3, 2017 11 Introduction Tensors

More information

Tensor Methods for Feature Learning

Tensor Methods for Feature Learning Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to

More information

Lecture: Numerical Linear Algebra Background

Lecture: Numerical Linear Algebra Background 1/36 Lecture: Numerical Linear Algebra Background http://bicmr.pku.edu.cn/~wenzw/opt-2017-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s and Michael Grant s lecture notes

More information

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Enhancing Performance of Tall-Skinny QR Factorization using FPGAs Abid Rafique Imperial College London August 31, 212 Enhancing Performance of Tall-Skinny QR Factorization using FPGAs 1/18 Our Claim Common

More information

Efficient CP-ALS and Reconstruction From CP

Efficient CP-ALS and Reconstruction From CP Efficient CP-ALS and Reconstruction From CP Jed A. Duersch & Tamara G. Kolda Sandia National Laboratories Livermore, CA Sandia National Laboratories is a multimission laboratory managed and operated by

More information

Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints

Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Anima Anandkumar U.C. Irvine Joint work with Daniel Hsu, Majid Janzamin Adel Javanmard and Sham Kakade.

More information

Matrix-Matrix Multiplication

Matrix-Matrix Multiplication Week5 Matrix-Matrix Multiplication 51 Opening Remarks 511 Composing Rotations Homework 5111 Which of the following statements are true: cosρ + σ + τ cosτ sinτ cosρ + σ sinρ + σ + τ sinτ cosτ sinρ + σ cosρ

More information

RWTH Aachen University

RWTH Aachen University IPCC @ RWTH Aachen University Optimization of multibody and long-range solvers in LAMMPS Rodrigo Canales William McDoniel Markus Höhnerbach Ahmed E. Ismail Paolo Bientinesi IPCC Showcase November 2016

More information

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS Alan Gray, Developer Technology Engineer, NVIDIA ECMWF 18th Workshop on high performance computing in meteorology, 28 th September 2018 ESCAPE NVIDIA s

More information

Linear Algebra (Review) Volker Tresp 2017

Linear Algebra (Review) Volker Tresp 2017 Linear Algebra (Review) Volker Tresp 2017 1 Vectors k is a scalar (a number) c is a column vector. Thus in two dimensions, c = ( c1 c 2 ) (Advanced: More precisely, a vector is defined in a vector space.

More information

Ilya A. Kaliman* and Anna I. Krylov. Introduction

Ilya A. Kaliman* and Anna I. Krylov. Introduction SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG New Algorithm for Tensor Contractions on Multi-Core CPUs, GPUs, and Accelerators Enables CCSD and EOM-CCSD Calculations with over 1000 Basis Functions on a Single

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

CVPR A New Tensor Algebra - Tutorial. July 26, 2017

CVPR A New Tensor Algebra - Tutorial. July 26, 2017 CVPR 2017 A New Tensor Algebra - Tutorial Lior Horesh lhoresh@us.ibm.com Misha Kilmer misha.kilmer@tufts.edu July 26, 2017 Outline Motivation Background and notation New t-product and associated algebraic

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

CME 335 Project Final Report: A nested dissection implementation with updating

CME 335 Project Final Report: A nested dissection implementation with updating CME 335 Project Final Report: A nested dissection implementation with updating Victor Minden March 18, 2015 1 Introduction Many partial differential equations (PDEs) arising from mathematical models of

More information

FEAST at Play: Feature ExtrAction using Score function Tensors

FEAST at Play: Feature ExtrAction using Score function Tensors JMLR: Workshop and Conference Proceedings 44 (2015) 130-144 NIPS 2015 The 1st International Workshop Feature Extraction: Modern Questions and Challenges FEAST at Play: Feature ExtrAction using Score function

More information

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4 Linear Algebra Section. : LU Decomposition Section. : Permutations and transposes Wednesday, February 1th Math 01 Week # 1 The LU Decomposition We learned last time that we can factor a invertible matrix

More information

Spectral Learning of Hidden Markov Models with Group Persistence

Spectral Learning of Hidden Markov Models with Group Persistence 000 00 00 00 00 00 00 007 008 009 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 Abstract In this paper, we develop a general Method of Moments

More information

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product Level-1 BLAS: SAXPY BLAS-Notation: S single precision (D for double, C for complex) A α scalar X vector P plus operation Y vector SAXPY: y = αx + y Vectorization of SAXPY (αx + y) by pipelining: page 8

More information

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with

More information

I Don t Care about BLAS

I Don t Care about BLAS I Don t Care about BLAS Devin Matthews Institute for Computational Engineering and Sciences, UT Austin #smallmoleculesmatter BLIS Retreat, Sept. 18, 217 1 From QC to DGEMM H ˆR i = E ˆR i Simple eigenproblem

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Appendix A. Proof to Theorem 1

Appendix A. Proof to Theorem 1 Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation

More information

BOOLEAN TENSOR FACTORIZATIONS. Pauli Miettinen 14 December 2011

BOOLEAN TENSOR FACTORIZATIONS. Pauli Miettinen 14 December 2011 BOOLEAN TENSOR FACTORIZATIONS Pauli Miettinen 14 December 2011 BACKGROUND: TENSORS AND TENSOR FACTORIZATIONS X BACKGROUND: TENSORS AND TENSOR FACTORIZATIONS C X A B BACKGROUND: BOOLEAN MATRIX FACTORIZATIONS

More information

B553 Lecture 5: Matrix Algebra Review

B553 Lecture 5: Matrix Algebra Review B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations

More information

The Convolution Operation

The Convolution Operation The Convolution Operation Sargur Srihari srihari@buffalo.edu 1 Topics in Convolutional Networks Overview 1. The Convolution Operation 2. Motivation 3. Pooling 4. Convolution and Pooling as an Infinitely

More information

Matrix-Matrix Multiplication

Matrix-Matrix Multiplication Chapter Matrix-Matrix Multiplication In this chapter, we discuss matrix-matrix multiplication We start by motivating its definition Next, we discuss why its implementation inherently allows high performance

More information

Balanced Dense Polynomial Multiplication on Multicores

Balanced Dense Polynomial Multiplication on Multicores Balanced Dense Polynomial Multiplication on Multicores Yuzhen Xie SuperTech Group, CSAIL MIT joint work with Marc Moreno Maza ORCCA, UWO ACA09, Montreal, June 26, 2009 Introduction Motivation: Multicore-enabling

More information

Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL

Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL Andreas Falkenstrøm Mieritz s093065 September 13, 2012 Supervisors: Allan Ensig-Peter Karup Bernd Dammann IMM-B.Sc.-2012-30 Contents 1 Problem

More information

Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition

Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition Discovering Fast Matrix Multiplication Algorithms via Tensor Decomposition Grey Ballard SIAM onference on omputational Science & Engineering March, 207 ollaborators This is joint work with... Austin Benson

More information

Performance optimization of WEST and Qbox on Intel Knights Landing

Performance optimization of WEST and Qbox on Intel Knights Landing Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University

More information

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

[POLS 8500] Review of Linear Algebra, Probability and Information Theory [POLS 8500] Review of Linear Algebra, Probability and Information Theory Professor Jason Anastasopoulos ljanastas@uga.edu January 12, 2017 For today... Basic linear algebra. Basic probability. Programming

More information

Symmetric rank-2k update on GPUs and/or multi-cores

Symmetric rank-2k update on GPUs and/or multi-cores Symmetric rank-2k update on GPUs and/or multi-cores Assignment 2, 5DV050, Spring 2012 Due on May 4 (soft) or May 11 (hard) at 16.00 Version 1.0 1 Background and motivation Quoting from Beresford N. Parlett's

More information

Scalar and Tensor Parameters for Importing Tensor Index Notation including Einstein Summation Notation

Scalar and Tensor Parameters for Importing Tensor Index Notation including Einstein Summation Notation 1 Scalar and Tensor Parameters for Importing Tensor Index Notation including Einstein Summation Notation SATOSHI EGI, Rakuten Institute of Technology In this paper, we propose a method for importing tensor

More information

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Direct and Incomplete Cholesky Factorizations with Static Supernodes Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)

More information

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points Michael Griebel Christian Rieger Peter Zaspel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität

More information

Section Partitioned Matrices and LU Factorization

Section Partitioned Matrices and LU Factorization Section 2.4 2.5 Partitioned Matrices and LU Factorization Gexin Yu gyu@wm.edu College of William and Mary partition matrices into blocks In real world problems, systems can have huge numbers of equations

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs Théo Mary, Ichitaro Yamazaki, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, Jack Dongarra presenter 1 Low-Rank

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

ICS 6N Computational Linear Algebra Matrix Algebra

ICS 6N Computational Linear Algebra Matrix Algebra ICS 6N Computational Linear Algebra Matrix Algebra Xiaohui Xie University of California, Irvine xhx@uci.edu February 2, 2017 Xiaohui Xie (UCI) ICS 6N February 2, 2017 1 / 24 Matrix Consider an m n matrix

More information

Note Set 3 Numerical Linear Algebra

Note Set 3 Numerical Linear Algebra Note Set 3 Numerical Linear Algebra 3.1 Overview Numerical linear algebra includes a class of problems which involves multiplying vectors and matrices, solving linear equations, computing the inverse of

More information

EE-559 Deep learning 1.4. Tensor basics and linear regression. François Fleuret Mon Mar 18 20:04:34 UTC 2019

EE-559 Deep learning 1.4. Tensor basics and linear regression. François Fleuret   Mon Mar 18 20:04:34 UTC 2019 EE-559 Deep learning 1.4. Tensor basics and linear regression François Fleuret https://fleuret.org/ee559/ Mon Mar 18 20:04:34 UTC 2019 A tensor is a generalized matrix, a finite table of numerical values

More information

Accelerating and Scaling Lanczos Diagonalization with GPGPU

Accelerating and Scaling Lanczos Diagonalization with GPGPU Accelerating and Scaling Lanczos Diagonalization with GPGPU Bill Brouwer, Filippo Spiga, Pierre-Yves Taunay, Sreejith GJ Nvidia GTC 2013 Outline Introduction Applications QE FQHE Theory Diagonalization

More information

Optimizing Time Integration of Chemical-Kinetic Networks for Speed and Accuracy

Optimizing Time Integration of Chemical-Kinetic Networks for Speed and Accuracy Paper # 070RK-0363 Topic: Reaction Kinetics 8 th U. S. National Combustion Meeting Organized by the Western States Section of the Combustion Institute and hosted by the University of Utah May 19-22, 2013

More information

Algorithms and Methods for Fast Model Predictive Control

Algorithms and Methods for Fast Model Predictive Control Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model

More information