MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Similar documents
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Dynamic Scheduling within MAGMA

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Lightweight Superscalar Task Execution in Distributed Memory

MARCH 24-27, 2014 SAN JOSE, CA

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

Computing least squares condition numbers on hybrid multicore/gpu systems

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

On the design of parallel linear solvers for large scale problems

Some notes on efficient computing and setting up high performance computing environments

A hybrid Hermitian general eigenvalue solver

Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Modeling and Tuning Parallel Performance in Dense Linear Algebra

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING

On the design of parallel linear solvers for large scale problems

Panorama des modèles et outils de programmation parallèle

Architecture-Aware Algorithms and Software for Peta and Exascale Computing

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Computing least squares condition numbers on hybrid multicore/gpu systems

Designing a QR Factorization for Multicore and Multi-GPU Architectures using Runtime Systems

TOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Direct Self-Consistent Field Computations on GPU Clusters

Communication-avoiding LU and QR factorizations for multicore architectures

Algorithms and Methods for Fast Model Predictive Control

Binding Performance and Power of Dense Linear Algebra Operations

Reducing the Amount of Pivoting in Symmetric Indefinite Systems

INITIAL INTEGRATION AND EVALUATION

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

ECS289: Scalable Machine Learning

Porting a Sphere Optimization Program from lapack to scalapack

Re-design of Higher level Matrix Algorithms for Multicore and Heterogeneous Architectures. Based on the presentation at UC Berkeley, October 7, 2009

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

The Future of LAPACK and ScaLAPACK

Porting a sphere optimization program from LAPACK to ScaLAPACK

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators

CS 594 Scientific Computing for Engineers Spring Dense Linear Algebra. Mark Gates

Dense Arithmetic over Finite Fields with CUMODP

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Crossing the Chasm. On the Paths to Exascale: Presented by Mike Rezny, Monash University, Australia

Introduction to numerical computations on the GPU

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist

Communication-avoiding parallel and sequential QR factorizations

Spécialité: Informatique. par. Marc Baboulin. Maître de conférences, Université Paris-Sud Chaire Inria Saclay - Île-de-France

Tools for Power and Energy Analysis of Parallel Scientific Applications

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Efficient algorithms for symmetric tensor contractions

arxiv: v1 [cs.dc] 19 Nov 2016

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

Solving PDEs with CUDA Jonathan Cohen

Matrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing)

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

A NOVEL PARALLEL QR ALGORITHM FOR HYBRID DISTRIBUTED MEMORY HPC SYSTEMS

A Reconfigurable Quantum Computer

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Communication avoiding parallel algorithms for dense matrix factorizations

11 Parallel programming models

Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

GPU accelerated Arnoldi solver for small batched matrix

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Accelerating Quantum Chromodynamics Calculations with GPUs

Coordinate Update Algorithm Short Course The Package TMAC

Verbundprojekt ELPA-AEO. Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen

arxiv: v1 [cs.ms] 18 Nov 2016

Intel Math Kernel Library (Intel MKL) LAPACK

Weather and Climate Modeling on GPU and Xeon Phi Accelerated Systems

The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Petascale Quantum Simulations of Nano Systems and Biomolecules

Transcription:

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013

MAGMA: LAPACK for hybrid systems MAGMA A new generation of HP Linear Algebra Libraries To provide LAPACK/ScaLAPACK on hybrid architectures http://icl.cs.utk.edu/magma/ MAGMA MIC 0.3 For hybrid, shared memory systems featuring Intel Xeon Phi coprocessors Included are one- sided factorizations Open Source Software ( http://icl.cs.utk.edu/magma ) MAGMA developers & collaborators UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia) Community effort, similar to LAPACK/ScaLAPACK

A New Generation of DLA Software LINPACK (70 s) (Vector operations) LAPACK (80 s) (Blocking, cache friendly) Software/Algorithms follow hardware evolution in time ScaLAPACK (90 s) (Distributed Memory) PLASMA (00 s) New Algorithms (many-core friendly) MAGMA Hybrid Algorithms (heterogeneity friendly) Rely on - Level-1 BLAS operations Rely on - Level-3 BLAS operations Rely on - PBLAS Mess Passing Rely on - a DAG/scheduler - block data layout - some extra kernels Rely on - hybrid scheduler - hybrid kernels

MAGMA Software Stack C P U H Y B R I D Coprocessor distr. multi Hybrid LAPACK/ScaLAPACK & Tile Algorithms / StarPU / DAGuE MAGMA 1.3 Hybrid Tile (PLASMA) Algorithms PLASMA / Quark StarPU run-time system MAGMA 1.3 Hybrid LAPACK and Tile Kernels single LAPACK MAGMA MIC 0.3 MAGMA SPARSE MAGMA BLAS BLAS BLAS CUDA / OpenCL / MIC Support: Linux, Windows, Mac OS X; C/C++, Fortran; Matlab, Python

MAGMA Functionality 80+ hybrid algorithms have been developed (total of 320+ routines) Every algorithm is in 4 precisions (s/c/d/z) There are 3 mixed precision algorithms (zc & ds) These are hybrid algorithms, expressed in terms of BLAS MAGMA MIC provides support for Intel Xeon Phi Coprocessors

Methodology overview A methodology to use all available resources: MAGMA MIC uses hybridization methodology based on Representing linear algebra algorithms as collections of tasks and data dependencies among them Properly scheduling tasks' execution over multicore CPUs and manycore coprocessors Successfully applied to fundamental linear algebra algorithms One- and two- sided factorizations and solvers Iterative linear and eigensolvers Productivity 1) High level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions Hybrid CPU+MIC algorithms (small tasks for multicores and large tasks for MICs) MIC MIC MIC

Hybrid Algorithms One-Sided Factorizations (LU, QR, and Cholesky) Hybridization Panels (Level 2 BLAS) are factored on CPU using LAPACK Trailing matrix updates (Level 3 BLAS) are done on the MIC using look-ahead )&*%!"#$%&'(% trailing matrix panel i panel i next panel i+1 trailing submatrix i

Programming LA on Hybrid Systems Algorithms expressed in terms of BLAS Use vendor- optimized BLAS Algorithms expressed in terms of tasks Use some scheduling/run- time system

Intel Xeon Phi speciqic considerations Intel Xeon Phi coprocessors (vs GPUs) are less dependent on host [ can login on the coprocessor, develop, and run programs in native mode] There is no high- level API similar to CUDA/OpenCL facilitating Intel Xeon Phi s use from the host * There is pragma API but it may be too high- level for HP numerical libraries We used Intel Xeon Phi s Low Level API (LLAPI) to develop MAGMA API [allows us to uniformly handle hybrid systems] * OpenCL 1.2 support is available for Intel Xeon Phi as of Intel SDK XE 2013 Beta

MAGMA MIC programming model Host Intel Xeon Phi Main ( ) server ( ) LLAPI PCIe Intel Xeon Phi acts as coprocessor On the Intel Xeon Phi, MAGMA runs a server Communications are implemented using LLAPI

A Hybrid Algorithm Example Left- looking hybrid Cholesky factorization in MAGMA The difference with LAPACK the 4 additional lines in red Line 8 (done on CPU) is overlapped with work on the MIC (from line 6)

MAGMA MIC programming model Intel Xeon Phi interface Host program Send asynchronous requests to the MIC; Queued & Executed on the MIC

MAGMA MIC Performance ( QR ) 900 800 700 MAGMA_DGEQRF CPU_DGEQRF Performance GFLOP/s 600 500 400 300 200 Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s 100 0 0 5000 10000 15000 20000 25000 Matrix Size N X N Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

MAGMA MIC Performance (Cholesky) 900 800 700 MAGMA_DPOTRF CPU_DPOTRF Performance GFLOP/s 600 500 400 300 200 Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s 100 0 0 5000 10000 15000 20000 25000 Matrix Size N X N Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

MAGMA MIC Performance (LU) 900 800 700 MAGMA_DGETRF CPU_DGETRF Performance GFLOP/s 600 500 400 300 200 Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s 100 0 0 5000 10000 15000 20000 25000 Matrix Size N X N Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

MAGMA MIC Performance +!!" :?5:?@A5BCDE" :?5:?@A5BFDE" :?5:?@A,GFDE" *!!" )!!",-./0.1234-"5607898" (!!" '!!" &!!" %!!" $!!" #!!"!"!" '!!!" #!!!!" #'!!!" $!!!!" $'!!!" %!!!!" :2;.<="8<>-" Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

MAGMA MIC Two-sided Factorizations Panels are also hybrid, using both CPU and MIC (vs. just CPU as in the one- sided factorizations) Need fast Level 2 BLAS use MIC s high bandwidth C P U Work G P U dwork 0 Y dy dv 4. Copy to CPU A j 3. Copy y to CPU j 1. Copy dp to CPU i 2. Copy v to GPU j

MAGMA MIC Performance (Hessenberg) 160 140 MAGMA_DGEHRD CPU_DGEHRD Performance GFLOP/s 120 100 80 60 40 20 0 0 5000 10000 15000 20000 25000 Matrix Size N X N Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

From Single to MultiMIC Support Data distribution 1- D block- cyclic distribution nb Algorithm MIC holding current panel is sending it to CPU MIC 0 MIC 1 MIC 2 MIC 0... All updates are done in parallel on the MICs Look- ahead is done with MIC holding the next panel

MAGMA MIC Scalability LU Factorization Performance in DP 9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM"" '"9NK" *+,-.,/012+"3456*78" #%!!" #$!!" ##!!" #!!!" '&!!" '%!!" '$!!" '#!!" '!!!" &!!" %!!" $!!" #!!"!"!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!" 90:,;<"=;>+"?"@"?" Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

MAGMA MIC Scalability LU Factorization Performance in DP 9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM"" #"9NK" '"9NK" *+,-.,/012+"3456*78" #%!!" #$!!" ##!!" #!!!" '&!!" '%!!" '$!!" '#!!" '!!!" &!!" %!!" $!!" #!!"!"!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!" 90:,;<"=;>+"?"@"?" Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

MAGMA MIC Scalability LU Factorization Performance in DP 9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM"" )"9NK" #"9NK" '"9NK" *+,-.,/012+"3456*78" #%!!" #$!!" ##!!" #!!!" '&!!" '%!!" '$!!" '#!!" '!!!" &!!" %!!" $!!" #!!"!"!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!" 90:,;<"=;>+"?"@"?" Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

MAGMA MIC Scalability LU Factorization Performance in DP 9A39A"B3CDE4"*+,-.,/012+F9GHIJH+"K0,LM"" $"9NK" )"9NK" #"9NK" '"9NK" *+,-.,/012+"3456*78" #%!!" #$!!" ##!!" #!!!" '&!!" '%!!" '$!!" '#!!" '!!!" &!!" %!!" $!!" #!!"!"!" (!!!" '!!!!" '(!!!" #!!!!" #(!!!" )!!!!" )(!!!" $!!!!" 90:,;<"=;>+"?"@"?" Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

MAGMA MIC Scalability QR Factorization Performance in DP MAGMA DGEQRF Performance(MulPple Card) Performance GFLOP/s 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 0 10000 20000 30000 40000 Matrix Size N X N 4 MIC 3 MIC 2 MIC 1 MIC Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

MAGMA MIC Scalability Cholesky Factorization Performance in DP MAGMA DPOTRF Performance(MulPple Card) 2200 Performance GFLOP/s 2000 1800 1600 1400 1200 1000 800 600 400 200 0 0 10000 20000 30000 40000 Matrix Size N X N 4 MIC 3 MIC 2 MIC 1 MIC Host Sandy Bridge (2 x 8 @2.6 GHz) DP Peak 332 GFlop/s Coprocessor Intel Xeon Phi ( 60 @ 1.09 GHz) DP Peak 1046 GFlop/s System DP Peak 1378 GFlop/s MPSS 2.1.4346-16 compiler_xe_2013.1.117

Current and Future MIC-speciQic directions Explore the use of tasks of lower granularity on MIC [ GPU tasks in general are large, data parallel] Reduce synchronizations [ less fork- join synchronizations ] Scheduling of less parallel tasks on MIC [ on GPUs, these tasks are typically ofjloaded to CPUs] [ to reduce CPU- MIC communications ] Develop more algorithms, porting newest developments in LA, while discovering further MIC- specijic optimizations

Current and Future MIC- speciqic directions Synchronization avoiding algorithms using Dynamic Runtime Systems 48 cores POTRF, TRTRI and LAUUM. The matrix is 4000 x 4000,tile size is 200 x 200

Current and Future MIC-speciQic directions High-productivity w/ Dynamic Runtime Systems From Sequential Nested- Loop Code to Parallel Execution for (k = 0; k < min(mt, NT); k++){ zgeqrt(a[k;k],...); for (n = k+1; n < NT; n++) zunmqr(a[k;k], A[k;n],...); for (m = k+1; m < MT; m++){ ztsqrt(a[k;k],,a[m;k],...); for (n = k+1; n < NT; n++) ztsmqr(a[m;k], A[k;n], A[m;n],...); } }

Current and Future MIC-speciQic directions High-productivity w/ Dynamic Runtime Systems From Sequential Nested- Loop Code to Parallel Execution for (k = 0; k < min(mt, NT); k++){ Insert_Task(&cl_zgeqrt, k, k,...); for (n = k+1; n < NT; n++) Insert_Task(&cl_zunmqr, k, n,...); for (m = k+1; m < MT; m++){ Insert_Task(&cl_ztsqrt, m, k,...); for (n = k+1; n < NT; n++) Insert_Task(&cl_ztsmqr, m, n, k,...); } }

Contact Jack Dongarra dongarra@eecs.utk.edu Innovative Computing Laboratory University of Tennessee, Knoxville MAGMA Team http://icl.cs.utk.edu/magma PLASMA team http://icl.cs.utk.edu/plasma Collaborating Partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver INRIA, France (StarPU team) KAUST, Saudi Arabia MAGMA PLASMA