Parallel Sparse Tensor Decompositions using HiCOO Format
|
|
- Emerald Violet Bennett
- 5 years ago
- Views:
Transcription
1 Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, SIAM ALA 8
2 Outline Background HiCOO Format Multicore CP-ALS Distributed CPDs
3 Tensors Social Networks Quantum Chemistry NOT ENOUGH Deep Learning
4 Applications of Tensor Methods 4 Data analytics and compression Natural language processing, healthcare analytics, social network analytics, brain signal processing, and deep learning Scientific computing Quantum chemistry, computational physics, quantum mechanics Demographic Procedure Diagnosis Medication EHR Clinical+ notes Lab+Tests Phenotyping Medical+ Concepts+ (phenotypes)
5 Tensors & Decompositions 5 Tensors, multi-way arrays, provide a natural way to represent multidimensional data. Special cases: matrices - D tensors, vectors - D tensors. Tensor decomposition: an extension of matrix factorization to analyze tensor features. Singular)Value)Decomposition)(SVD) Tensor mode or order: tensor dimension. A SPARSE tensor, a tensor consisting mostly of zero entries, widely exists in real applications. CANDECOMP/PARAFAC CP#Decomposition i =,,I j =,,J k =,,K
6 CP-ALS and CP-APR 6 CP-ALS - alternating least squares For general data Describe input data as Gaussian distribution Fitting function: least squares. Core operation: MTTKRP CP-APR alternating Poisson regression For non-negative data, e.g. count data. Describe input data as Poisson distribution Fitting function:poisson likelihood fitting algorithm Core operation: element-wise division.
7 CP-ALS 7 Bottleneck: MTTKRP
8 MTTKRP Operation 8 Matriced Tensor Times Khatri-Rao Product (MTTKRP) Tensor Matricization Khatri-Rao Product Updated Factor Matrix Factor Matrix
9 MTTKRP Operation 9 Matriced Tensor Times Khatri-Rao Product (MTTKRP) Tensor Matricization Khatri-Rao Product Khatri-Rao Product D C. B Matrix Updated Factor Matrix Factor Matrix R R R I J IJ C B D
10 CP-APR Bottleneck: element-wise product
11 Bottleneck of CP-APR Consider CP-APR MU Stored in sparse format, size nnz * R
12 Outline Background HiCOO Format Multicore CP-ALS Distributed CPDs
13 Mode Orientation Tensor decomposition Kernel in Mode- Mode Oriented Mode- oriented (CSF) Neutral Mode Orientation Coordinate (COO) Efficient Inefficient Kernel in Mode-?? Kernel in Mode-??
14 Mode Oriented Formats 4 Three CSF/F-COO representations are required/preferred for the three MTTKRPs. CP Decomposition MTTKRP in Mode- i j CSF- k val More storage MTTKRP in Mode- j i k CSF- val MTTKRP in Mode- k i CSF- j val
15 Mode Oriented Formats 5 Three CSF/F-COO representations are required/preferred for the three MTTKRPs. CP Decomposition MTTKRP in Mode- i j CSF- k val MTTKRP in Mode- Performance payoff MTTKRP in Mode-
16 Current Sparse Tensor Formats 6 COO: coordinate formats [Bader et al., 6] CSF: Compressed Sparse Fibers, extension of CSR. [Smith et al. 5] F-COO: Flagged COO format [Liu et al., 7] 4 i =,,I j =,,J 4 k =,,K
17 Current Sparse Tensor Formats 7 COO: coordinate formats [Bader et al., 6] CSF: Compressed Sparse Fibers, extension of CSR. [Smith et al. 5] F-COO: Flagged COO format [Liu et al., 7] 4 i =,,I j =,,J 4 k =,,K i j k val i j k val sf[]= sf[]= bf j k val (a) COO (b) CSF (c) F-COO
18 Current Sparse Tensor Formats 8 COO: coordinate formats [Bader et al., 6] CSF: Compressed Sparse Fibers, extension of CSR. [Smith et al. 5] F-COO: Flagged COO format [Liu et al., 7] 4 i =,,I j =,,J 4 k =,,K i j k val i j k val sf[]= sf[]= bf j k val (a) COO (b) CSF (c) F-COO neutral mode orientation mode oriented, prefer different representations for different modes.
19 HiCOO Format 9 Store a sparse tensor in units of small sparse blocks, saving storage Shorten the bit-length of element indices Compress the number of block indices an improvement of the Compressed Sparse Blocks (CSB) format -bit -bit 8-bit i j k val bptr bi bj bk ei ej ek val 4 B B 4 For the tensor: Reduce its storage and memory footprints 5 6 B For matrices: Better data locality 7 8 B (a) COO (b) HiCOO i = bi * B + ei
20 Construct HiCOO Representation Morton-order sorting for the input COO tensor. Partition to blocks and determine each block location. Construct HiCOO representation by compressing index length. bits int i j k val COO 7 8 Z-Order Sorting int long int long int byte i j k val Partition bptr 4 6 i j k val Compress i = bi*b+ei; j = bj*b+ej; k= bk*b+ek 6 8 B B B B bptr 4 6 bi bj bk ei ej HiCOO ek val
21 Outline Background HiCOO Format Multicore CP-ALS Distributed CPDs
22 Two-level Blocking for Efficient Thread Parallelism Use two-level blocking strategy Large bulks (logical) + small blocks (physical) To avoid using locks, we add a bit extra storage for scheduling information. y No Write Conflicts iter iter iter Bulk Bulk 4 Bulk 8 Bulk Bulk 5 Bulk 9 Bulk Bulk 6 Bulk Bulk 7 Write Conflicts x bulk bptr bi bj bk ei ej ek val B i =,,I B B j =,,J k =,,K B 6 HiCOO 6 8
23 Platform and Dataset Platform: Intel Xeon CPU E7-485 platform consisting 56 physical cores with gcc 5.4. and parallelized by OpenMP. Dataset: FROSTT [Smith et al. 7] and HaTen [Jeon et al. 5] J. Li, J. Sun, R. Vuduc. HiCOO: Hierarchical Storage of Sparse Tensors. 8
24 Multicore MTTKRP 4 ParTI! library: Speedups of HiCOO over COO Speedup mode nell choa darpa fb-m fb-s deli nell crime nips enron flickr deli4d -D Tensors 4-D Tensors.8x SPLATT library: Speedups of HiCOO over CSF Storage size: Speedup mode nell choa darpa fb-m fb-s deli nell crime nips enron flickr deli4d -D Tensors 4-D Tensors.x comparable
25 Multicore CP-ALS 5 HiCOO outperforms COO by up to 9.4 and CSF up to.4. Normalized Time HiCOO CSF COO. nell choa darpa fb-m fb-s deli nell crime nips enron flickr -D Tensors 4-D Tensors deli4d
26 Outline 6 Background HiCOO Format Multicore CP-ALS Distributed CPDs
27 Data Distribution 7 Tensor: Two-level blocking Bulk level: node partitioning to ensure nonzero balance Block level: ensure fast local computation Factor matrices Each processor owns the portion of factor matrices its local tensor needs. Some duplication exists. i =,,I j =,,J bulk k =,,K B B B B bptr 4 6 bi bj bk ei HiCOO ej ek val
28 Data Distribution Cont. 8 X: 4*4*4 sparse tensor Processors are split to ** grids. Use sub-communicator to update factor matrices. A(:) B(:) C(:) A(:) B(:) C(:) P(,,) P(,,) A(:) B(:) C(:) A(:) B(:) C(:) P(,,) P(,,)
29 Platform and Dataset 9 Platform: Cori supercomputer in NERSC, which is Cray XC4. Each node is Intel Xeon CPU E5-698 v ( Haswell ) consisting physical cores. We use Intel MPI 8 version. We test to 4 processors with processors per node. Dataset: Synthetic tensors from Poisson distribution.
30 Distributed MTTKRP Performance COO HiCOO Time (s). syn syn.5 syn #Procs 5 4 Time (s).6 syn.5 syn syn #Procs MTTKRP time using 4 procs Time (s) HiCOO COO.. syn syn Tensors syn
31 Distributed CP-ALS Performance Time (s).5..5 COO.5 syn syn. syn #Procs 5 4 Time (s) HiCOO. syn syn. syn #Procs 5 4 CP-ALS time using 4 procs Time (s).8 HiCOO.7 COO syn syn syn Tensors
32 Distributed CompPhi Performance COO HiCOO Time (s) syn syn 8 syn #Procs 5 4 Time (s) #Procs 5 syn syn syn 4 CompPhi time using 4 procs Time (s) HiCOO COO. syn syn Tensors syn
33 Distributed CP-APR Performance COO HiCOO Time (s) 5 syn syn syn #Procs 5 4 Time (s) #Procs 5 syn syn syn 4 CP-APR time using 4 procs Time (s).8 HiCOO.7 COO syn syn syn Tensors
34 Conclusion and Future Work 4 Conclusion HiCOO WORKS! Future work Combine multicore and distributed parallelism together to build hybrid MPI+OpenMP HiCOO CPDs. Also include our GPU implementations. Consider CP-APR algorithms to Newton-based optimization methods.
35 ParTI! Project 5 Hardware Multicore8 CPUs GPUs Distributed8 Memory Emu ParTI!'Library Sparse Baseline8Tensor8 Routines Tensor8 Decomposition Algorithms Code: jiajiali@gatech.edu
36 Contributors 6 Jiajia Li PhD, GaTech & PNNL* * From Aug 8 Yuchen Ma Undergrad, HDU Nick Liu Undergrad, GaTech Srinivas Eswar PhD, GaTech Junghyun Kim Undergrad, GaTech Richard Vuduc Professor, GaTech
37 Acknowledgement 7
38 Related Work 8 S. Smith et al. A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization, IPDPS 6 O. Kaya et al. Scalable Sparse Tensor Decompositions in Distributed Memory Systems, SC 5 E. Solomonik et al. Sparse Tensor Algebra as a Parallel Programming Model, TR, 5 S. Rajbhandari et al. A communication-optimal framework for contracting distributed tensors, SC'4 W. Austin et al. Parallel Tensor Compression for Large-Scale Scientific Data, IPDPS 6 J. Choi et al. DFacTo: Distributed Factorization of Tensors, NIPS 4
A Unified Optimization Approach for Sparse Tensor Operations on GPUs
A Unified Optimization Approach for Sparse Tensor Operations on GPUs Bangtian Liu, Chengyao Wen, Anand D. Sarwate, Maryam Mehri Dehnavi Rutgers, The State University of New Jersey {bangtian.liu, chengyao.wen,
More informationMemory-efficient Parallel Tensor Decompositions
Patent Pending Memory-efficient Parallel Tensor Decompositions Muthu Baskaran, Tom Henretty, Benoit Pradelle, M. Harper Langston, David Bruns-Smith, James Ezick, Richard Lethin Reservoir Labs Inc. New
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationModel-Driven Sparse CP Decomposition for Higher-Order Tensors
7 IEEE International Parallel and Distributed Processing Symposium Model-Driven Sparse CP Decomposition for Higher-Order Tensors Jiajia Li, Jee Choi, Ioakeim Perros, Jimeng Sun, Richard Vuduc Computational
More informationTensor-Matrix Products with a Compressed Sparse Tensor
Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith shaden@cs.umn.edu Department of Computer Science and Engineering University of Minnesota, USA George Karypis karypis@cs.umn.edu ABSTRACT
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More informationCyclops Tensor Framework
Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r
More informationSparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory
2017 IEEE International Parallel and Distributed Processing Symposium Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory Shaden Smith, Jongsoo Park, George Karypis Department
More informationAn Investigation of Sparse Tensor Formats for Tensor Libraries. Parker Allen Tew
An Investigation of Sparse Tensor Formats for Tensor Libraries by Parker Allen Tew S.B., Massachusetts Institute of Technology (2015) Submitted to the Department of Electrical Engineering and Computer
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationPerformance Analysis of Lattice QCD Application with APGAS Programming Model
Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models
More informationSparse solver 64 bit and out-of-core addition
Sparse solver 64 bit and out-of-core addition Prepared By: Richard Link Brian Yuen Martec Limited 1888 Brunswick Street, Suite 400 Halifax, Nova Scotia B3J 3J8 PWGSC Contract Number: W7707-145679 Contract
More informationHigh Performance Parallel Tucker Decomposition of Sparse Tensors
High Performance Parallel Tucker Decomposition of Sparse Tensors Oguz Kaya INRIA and LIP, ENS Lyon, France SIAM PP 16, April 14, 2016, Paris, France Joint work with: Bora Uçar, CNRS and LIP, ENS Lyon,
More informationCSTF: Large-Scale Sparse Tensor Factorizations on Distributed Platforms
CSTF: Large-Scale Sparse Tensor Factorizations on Distributed Platforms Zachary Blanco Rutgers University zac.blanco@rutgers.edu Bangtian Liu Rutgers University bangtian.liu@rutgers.edu Maryam Mehri Dehnavi
More informationMultilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota
Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work
More informationERLANGEN REGIONAL COMPUTING CENTER
ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,
More informationA Medium-Grained Algorithm for Distributed Sparse Tensor Factorization
A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization Shaden Smith, George Karypis Department of Computer Science and Engineering, University of Minnesota {shaden, karypis}@cs.umn.edu
More informationSolution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI
Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Sagar Bhatt Person Number: 50170651 Department of Mechanical and Aerospace Engineering,
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationA Computation- and Communication-Optimal Parallel Direct 3-body Algorithm
A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,
More informationLarge-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors
Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr) Principal Researcher / Korea Institute of Science and Technology
More informationCyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions
Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationScalable and Power-Efficient Data Mining Kernels
Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the
More informationStrassen s Algorithm for Tensor Contraction
Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,
More informationSparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationTechnical Report TR SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication
Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 Keller Hall 200 Union Street SE Minneapolis, MN 55455-0159 USA TR 15-008 SPLATT: Efficient and Parallel Sparse
More informationComputers and Mathematics with Applications
Computers and Mathematics with Applications 68 (2014) 1151 1160 Contents lists available at ScienceDirect Computers and Mathematics with Applications journal homepage: www.elsevier.com/locate/camwa A GPU
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationSPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics
SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationClaude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique
Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationCA-SVM: Communication-Avoiding Support Vector Machines on Distributed System
CA-SVM: Communication-Avoiding Support Vector Machines on Distributed System Yang You 1, James Demmel 1, Kent Czechowski 2, Le Song 2, Richard Vuduc 2 UC Berkeley 1, Georgia Tech 2 Yang You (Speaker) James
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationConstrained Tensor Factorization with Accelerated AO-ADMM
Constrained Tensor actorization with Accelerated AO-ADMM Shaden Smith, Alec Beri, George Karypis Department of Computer Science and Engineering, University of Minnesota, Minneapolis, USA Department of
More informationOptimization of Symmetric Tensor Computations
Optimization of Symmetric Tensor Computations Jonathon Cai Department of Computer Science, Yale University New Haven, CT 0650 Email: jonathon.cai@yale.edu Muthu Baskaran, Benoît Meister, Richard Lethin
More informationParallel Polynomial Evaluation
Parallel Polynomial Evaluation Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu
More informationMulticore Parallelization of Determinant Quantum Monte Carlo Simulations
Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationMassively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem
Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing
More informationDFacTo: Distributed Factorization of Tensors
DFacTo: Distributed Factorization of Tensors Joon Hee Choi Electrical and Computer Engineering Purdue University West Lafayette IN 47907 choi240@purdue.edu S. V. N. Vishwanathan Statistics and Computer
More informationEfficient CP-ALS and Reconstruction From CP
Efficient CP-ALS and Reconstruction From CP Jed A. Duersch & Tamara G. Kolda Sandia National Laboratories Livermore, CA Sandia National Laboratories is a multimission laboratory managed and operated by
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationImplementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:
Implementing QR Factorization Updating Algorithms on GPUs Andrew, Robert and Dingle, Nicholas J. 214 MIMS EPrint: 212.114 Manchester Institute for Mathematical Sciences School of Mathematics The University
More informationReview for the Midterm Exam
Review for the Midterm Exam 1 Three Questions of the Computational Science Prelim scaled speedup network topologies work stealing 2 The in-class Spring 2012 Midterm Exam pleasingly parallel computations
More informationOpen-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --
Parallel Processing for Energy Efficiency October 3, 2013 NTNU, Trondheim, Norway Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer
More informationMassively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling
2019 Intel extreme Performance Users Group (IXPUG) meeting Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr)
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More informationTENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018
TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist, NVIDIA GTC 2018 Tensors Computations and the GPU AGENDA Tensor Networks and Decompositions Tensor Layers in
More informationFINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION
FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros
More informationGPU Computing Activities in KISTI
International Advanced Research Workshop on High Performance Computing, Grids and Clouds 2010 June 21~June 25 2010, Cetraro, Italy HPC Infrastructure and GPU Computing Activities in KISTI Hongsuk Yi hsyi@kisti.re.kr
More informationParallel Matrix Factorization for Recommender Systems
Under consideration for publication in Knowledge and Information Systems Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon Department of
More information1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria
1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation
More informationFINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION
FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros
More informationConvergence Models and Surprising Results for the Asynchronous Jacobi Method
Convergence Models and Surprising Results for the Asynchronous Jacobi Method Jordi Wolfson-Pou School of Computational Science and Engineering Georgia Institute of Technology Atlanta, Georgia, United States
More informationQR Decomposition in a Multicore Environment
QR Decomposition in a Multicore Environment Omar Ahsan University of Maryland-College Park Advised by Professor Howard Elman College Park, MD oha@cs.umd.edu ABSTRACT In this study we examine performance
More informationComputational Numerical Integration for Spherical Quadratures. Verified by the Boltzmann Equation
Computational Numerical Integration for Spherical Quadratures Verified by the Boltzmann Equation Huston Rogers. 1 Glenn Brook, Mentor 2 Greg Peterson, Mentor 2 1 The University of Alabama 2 Joint Institute
More informationEfficient implementation of the overlap operator on multi-gpus
Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator
More informationMARCH 24-27, 2014 SAN JOSE, CA
MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =
More informationPerformance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures
Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse
More informationExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. X, M YYYY 1 ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems Sameh Abdulah, Hatem Ltaief, Ying Sun,
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. SCI. COMPUT. Vol. 37, No. 2, pp. C169 C193 c 2015 Society for Industrial and Applied Mathematics FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationConsider the following example of a linear system:
LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x + 2x 2 + 3x 3 = 5 x + x 3 = 3 3x + x 2 + 3x 3 = 3 x =, x 2 = 0, x 3 = 2 In general we want to solve n equations
More informationA Fill Estimation Algorithm for Sparse Matrices and Tensors in Blocked Formats
A Fill Estimation Algorithm for Sparse Matrices and Tensors in Blocked Formats Peter Ahrens, Helen Xu, and Nicholas Schiefer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More informationHYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017
HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at
More informationA hybrid Hermitian general eigenvalue solver
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,
More informationParallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and
More informationJacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA
Jacobi-Davidson Eigensolver in Cusolver Library Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline CuSolver library - cusolverdn: dense LAPACK - cusolversp: sparse LAPACK - cusolverrf: refactorization
More informationPerformance Evaluation of MPI on Weather and Hydrological Models
NCAR/RAL Performance Evaluation of MPI on Weather and Hydrological Models Alessandro Fanfarillo elfanfa@ucar.edu August 8th 2018 Cheyenne - NCAR Supercomputer Cheyenne is a 5.34-petaflops, high-performance
More informationRWTH Aachen University
IPCC @ RWTH Aachen University Optimization of multibody and long-range solvers in LAMMPS Rodrigo Canales William McDoniel Markus Höhnerbach Ahmed E. Ismail Paolo Bientinesi IPCC Showcase November 2016
More informationTransposition Mechanism for Sparse Matrices on Vector Processors
Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands
More informationEveryday Multithreading
Everyday Multithreading Parallel computing for genomic evaluations in R C. Heuer, D. Hinrichs, G. Thaller Institute of Animal Breeding and Husbandry, Kiel University August 27, 2014 C. Heuer, D. Hinrichs,
More informationFINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING
FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING Daniel Thuerck 1,2 (advisors Michael Goesele 1,2 and Marc Pfetsch 1 ) Maxim Naumov 3 1 Graduate School of Computational Engineering, TU Darmstadt
More informationAcceleration of WRF on the GPU
Acceleration of WRF on the GPU Daniel Abdi, Sam Elliott, Iman Gohari Don Berchoff, Gene Pache, John Manobianco TempoQuest 1434 Spruce Street Boulder, CO 80302 720 726 9032 TempoQuest.com THE WORLD S FASTEST
More informationUtilisation de la compression low-rank pour réduire la complexité du solveur PaStiX
Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.
More informationA CUDA Solver for Helmholtz Equation
Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College
More informationDistributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria
Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech
More informationScalable Non-blocking Preconditioned Conjugate Gradient Methods
Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationParallel Tensor Compression for Large-Scale Scientific Data
Parallel Tensor Compression for Large-Scale Scientific Data Woody Austin, Grey Ballard, Tamara G. Kolda April 14, 2016 SIAM Conference on Parallel Processing for Scientific Computing MS 44/52: Parallel
More informationParallel Sparse Matrix - Vector Product
IMM Technical Report 2012-10 Parallel Sparse Matrix - Vector Product - Pure MPI and hybrid MPI-OpenMP implementation - Authors: Joe Alexandersen Boyan Lazarov Bernd Dammann Abstract: This technical report
More informationParallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers
Parallel Multivariate SpatioTemporal Clustering of Large Ecological Datasets on Hybrid Supercomputers Sarat Sreepathi1, Jitendra Kumar1, Richard T. Mills2, Forrest M. Hoffman1, Vamsi Sripathi3, William
More information- Part 4 - Multicore and Manycore Technology: Chances and Challenges. Vincent Heuveline
- Part 4 - Multicore and Manycore Technology: Chances and Challenges Vincent Heuveline 1 Numerical Simulation of Tropical Cyclones Goal oriented adaptivity for tropical cyclones ~10⁴km ~1500km ~100km 2
More informationPiz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess
Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting Thomas C. Schulthess 1 Cray XC30 with 5272 hybrid, GPU accelerated compute nodes Piz Daint Compute node:
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationA randomized block sampling approach to the canonical polyadic decomposition of large-scale tensors
A randomized block sampling approach to the canonical polyadic decomposition of large-scale tensors Nico Vervliet Joint work with Lieven De Lathauwer SIAM AN17, July 13, 2017 2 Classification of hazardous
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More informationIntroduction to Benchmark Test for Multi-scale Computational Materials Software
Introduction to Benchmark Test for Multi-scale Computational Materials Software Shun Xu*, Jian Zhang, Zhong Jin xushun@sccas.cn Computer Network Information Center Chinese Academy of Sciences (IPCC member)
More informationAccelerating incompressible fluid flow simulations on hybrid CPU/GPU systems
Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems Yushan Wang 1, Marc Baboulin 1,2, Karl Rupp 3,4, Yann Fraigneau 1,5, Olivier Le Maître 1,5 1 Université Paris-Sud, France 2
More information