Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications

Similar documents
Performance Evaluation of Scientific Applications on POWER8

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

ALU A functional unit

ERLANGEN REGIONAL COMPUTING CENTER

ab initio Electronic Structure Calculations

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Algorithms and Methods for Fast Model Predictive Control

Contents. Preface... xi. Introduction...

Empowering Scientists with Domain Specific Languages

Software optimization for petaflops/s scale Quantum Monte Carlo simulations

Introduction The Nature of High-Performance Computation

Cost/Performance Tradeoffs:

Mapping Sparse Matrix-Vector Multiplication on FPGAs

[2] Predicting the direction of a branch is not enough. What else is necessary?

Measuring freeze-out parameters on the Bielefeld GPU cluster

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Compiling Techniques

Performance Analysis of a List-Based Lattice-Boltzmann Kernel

ICS 233 Computer Architecture & Assembly Language

2.5D algorithms for distributed-memory computing

Efficient implementation of the overlap operator on multi-gpus

Accelerating linear algebra computations with hybrid GPU-multicore systems.

3. (2) What is the difference between fixed and hybrid instructions?

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

COVER SHEET: Problem#: Points

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

arxiv: v1 [hep-lat] 8 Nov 2014

Logic BIST. Sungho Kang Yonsei University

Huge-Scale Molecular Dynamics Simulation of Multi-bubble Nuclei

P214 Efficient Computation of Passive Seismic Interferometry

SPECIAL PROJECT PROGRESS REPORT

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

arxiv: v1 [cs.dc] 4 Sep 2014

Efficient algorithms for symmetric tensor contractions

Lecture 4: Linear Algebra 1

Linear System of Equations

Automated design of floating-point logarithm functions on integer processors

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Transposition Mechanism for Sparse Matrices on Vector Processors

CSCI-564 Advanced Computer Architecture

ww.padasalai.net

4. (3) What do we mean when we say something is an N-operand machine?

Solving PDEs with CUDA Jonathan Cohen

Figure 4.9 MARIE s Datapath

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

arxiv: v1 [hep-lat] 7 Oct 2010

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

HPMPC - A new software package with efficient solvers for Model Predictive Control

[2] Predicting the direction of a branch is not enough. What else is necessary?

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Cache Oblivious Stencil Computations

A Hybrid Method for the Wave Equation. beilina

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Lecture 19. Architectural Directions

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

CMP 334: Seventh Class

Leigh Orf 1 Robert Wilhelmson 2,3 Roberto Sisneros 3 Brian Jewett 2 George Bryan 4 Mark Straka 3 Paul Woodward 5

Hardware Acceleration of the Tate Pairing in Characteristic Three

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Modeling and Tuning Parallel Performance in Dense Linear Algebra

A Trillion Particles: Studying Large-Scale Structure Formation on the BG/Q

EXPLOITING RESIDUE NUMBER SYSTEM FOR POWER-EFFICIENT DIGITAL SIGNAL PROCESSING IN EMBEDDED PROCESSORS

(Group-theoretic) Fast Matrix Multiplication

FPGA Implementation of a Predictive Controller

Sparse BLAS-3 Reduction

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

arxiv: v1 [hep-lat] 10 Jul 2012

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

CprE 281: Digital Logic

CMP 338: Third Class

Homework 4 due today Quiz #4 today In class (80min) final exam on April 29 Project reports due on May 4. Project presentations May 5, 1-4pm

CPSC 3300 Spring 2017 Exam 2

The next-generation supercomputer and NWP system of the JMA

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Lattice QCD with Domain Decomposition on Intel R Xeon Phi TM

Chenhan D. Yu The 3rd BLIS Retreat Sep 28, 2015

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

Computer Architecture

Exploring performance and power properties of modern multicore chips via simple machine models

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

CS 700: Quantitative Methods & Experimental Design in Computer Science

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Transcription:

Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications 2016 Aug 23 P. F. Baumeister, T. Hater, D. Pleiter H. Boettiger, T. Maurer, J. R. Brunheroto

Contributors IBM R&D Lab Böblingen Thilo Maurer Hans Boettiger IBM T.J. Watson Center, NY José R Brunheroto JSC IBM Jülich Supercomputing Centre Thorsten Hater Dirk Pleiter youreuropemap.com 2016 Aug 23 Paul F Baumeister 2

Outline Introduction: Processing in memory Active Memory Cube design compute lane architecture Programming the Density Functional Theory Finite-Differences on Small matrix-matrix multiplications on Application improvement Conclusions, Outlook 2016 Aug 23 Paul F Baumeister 3

Why processing in memory Data transport becomes more expensive in terms of energy compared to compute Energy per Flop reduces for smaller feature size Data transport hardly becomes less expensive Gap opens between BW and Flop performance Possible solution: Move processing closer to memory 2016 Aug 23 Paul F Baumeister 4

The Active Memory Cube design IBM design based on the Hybrid Memory Cube 3D design with several memory layers and one logic layer Thermal design power: 10W per Node layout HMC, picture by IBM Host CPU Network 2016 Aug 23 Paul F Baumeister 5

compute lane architecture Register files with 17 kibyte total per lane 32 scalar registers, 16 vector registers á 32 entries per slice 64bit registers (2-way SIMD single precision instructions possible) Read access to vector registers of other slices enabled No caches Offload-model 1.25 GHz 10 GFlop/s (dp) 10 GByte/s 32 lanes/ 2016 Aug 23 Paul F Baumeister 6

Programming the Same address space as CPU Micro-coded architecture Exposed pipeline Very Long Instruction Words MIMD paradigm 1 VLIW: BU [#R] {ALU0; LSU0} {ALU1; LSU1} {ALU2; LSU2} {ALU3; LSU3}! Limited VLIW buffer (size 512) Instruction repeat count up to [32]! Cycle-accurate simulations using Mambo! 1 Flop/Byte critical arithmetic intensity 2016 Aug 23 Paul F Baumeister 7

Density Functional Theory Workhorse formalism for solid state physics Kernel: diagonalize or invert the Hamiltonian Ĥ = ˆT + ˆV = @ xx @ yy @ zz + V (x,y,z) Material properties accessible by DFT electronic magnetic structural mechanical chemical thermodynamical In real-space representation, the kinetic energy operator T (Laplacian) can be constructed as short ranged, e.g. by finite-differences as in jurs 2016 Aug 23 Paul F Baumeister 8

High-order finite-difference derivative Second derivative in Finite-Difference with uniform grid spacing h (i) = 1 [ (i 1) 2 (i)+ (i + 1) ] h2 2 nd order in 1D Derivative Error Array halos necessary Controllable accuracy wave number k*h 4 th order stencil in 3D 2016 Aug 23 Paul F Baumeister 9 kinetic energy

3D FD-Hamiltonian in 8 th order 3D Laplacian stencil: data-reuse only along the direction of loop traversal è Low arithmetic intensity: 0.34 Flop/Byte Decompose the action of H = T + V into 3 passes, traverse along x, y, z, respectively fdd-vx: wx[x,y,z,:] := (Txx + V[x,y,z]) w[x,y,z,:] fdd-yz: wy[x,y,z,:] := wx[x,y,z,:] + Tyy w[x,y,z,:] fdd-yz: Hw[x,y,z,:] := wy[x,y,z,:] + Tzz w[x,y,z,:] [:] = Vectorize over 32 independent w s 1.1 Flop/Byte 0.7 Flop/Byte 0.7 Flop/Byte 2016 Aug 23 Paul F Baumeister 10

1D finite-differences on Parallelization over slices with a phase shift Manual tuning of delay between load and use Halo regions infer a constant load overhead s0 ld A[ 4] s1 ld A[ 3] s2 ld A[ 2] s3 ld A[ 1] s0 ld A[0] s1 ld A[1] s2 ld A[2] s3 ld A[3] s0 ld A[4] s3 ld A[5] s1 ld A[6] s2 ld A[7] Load source array * slice 0 * slice 1 * slice 2 * slice 3 128 cycles Vectorization over 32 independent w s time FMAs Example for a very short row with only 4 grid points Store target array 2016 Aug 23 Paul F Baumeister 11 s0 st T[0] s1 st T[1] s2 st T[2] s3 st T[3]

Horizontal microcode for the Manual register allocation L: [ 1] { f1mul vr8, vr3.0, sr20 ; st8u vr9, sr9, sr5 } { f1madd vr9, vr0.1, sr28, vr9 ; } { f1madd vr9, vr0.1, sr27, vr9 ; } { f1madd vr9, vr0.1, sr26, vr9 ; } [31] { f1mul(c) vr8, vr3.0, sr20 ; st8u(c) vr9, <sr9, sr3 } { f1madd(c) vr9, vr0.1, sr28, vr9 ; } { f1madd(c) vr9, vr0.1, sr27, vr9 ; } { f1madd(c) vr9, vr0.1, sr26, vr9 ; } [ 1] { f1madd vr8, vr3.1, sr21, vr8 ; ld8u vr1, sr8, sr5 } { f1mul vr8, vr3.1, sr20 ; st8u vr9, sr9, sr5 } { f1madd vr9, vr0.2, sr28, vr9 ; } { f1madd vr9, vr0.2, sr27, vr9 ; } [31] { f1madd(c) vr8, vr3.1, sr21, vr8 ; ld8u(c) vr1, <sr8, sr3 } { f1mul(c) vr8, vr3.1, sr20 ; st8u(c) vr9, <sr9, sr3 } { f1madd(c) vr9, vr0.2, sr28, vr9 ; } { f1madd(c) vr9, vr0.2, sr27, vr9 ; } [ 1] { f1madd vr8, vr3.2, sr22, vr8 ; } { f1madd vr8, vr3.2, sr21, vr8 ; ld8u vr1, sr8, sr5 } { f1mul vr8, vr3.2, sr20 ; st8u vr9, sr9, sr5 } { f1madd vr9, vr0.3, sr28, vr9 ; } [31] { f1madd(c) vr8, vr3.2, sr22, vr8 ; } { f1madd(c) vr8, vr3.2, sr21, vr8 ; ld8u(c) vr1, <sr8, sr3 } { f1mul(c) vr8, vr3.2, sr20 ; st8u(c) vr9, <sr9, sr3 } { f1madd(c) vr9, vr0.3, sr28, vr9 ; } [ 1] { f1madd vr8, vr3.3, sr23, vr8 ; } { f1madd vr8, vr3.3, sr22, vr8 ; } { f1madd vr8, vr3.3, sr21, vr8 ; ld8u vr1, sr8, sr5 } { f1mul vr8, vr3.3, sr20 ; st8u vr9, sr9, sr5 } [31] { f1madd(c) vr8, vr3.3, sr23, vr8 ; } { f1madd(c) vr8, vr3.3, sr22, vr8 ; } { f1madd(c) vr8, vr3.3, sr21, vr8 ; ld8u(c) vr1, <sr8, sr3 } { f1mul(c) vr8, vr3.3, sr20 ; st8u(c) vr9, <sr9, sr3 } [ 1] { f1madd vr8, vr0.0, sr24, vr8 ; } { f1madd vr8, vr0.0, sr23, vr8 ; } { f1madd vr8, vr0.0, sr22, vr8 ; } { f1madd vr8, vr0.0, sr21, vr8 ; ld8u vr1, sr8, sr5 } [31] { f1madd(c) vr8, vr0.0, sr24, vr8 ; } { f1madd(c) vr8, vr0.0, sr23, vr8 ; } { f1madd(c) vr8, vr0.0, sr22, vr8 ; } { f1madd(c) vr8, vr0.0, sr21, vr8 ; ld8u(c) vr1, <sr8, sr3 } [32] { f1madd vr8, vr0.1, sr25, vr8 ; } { f1madd vr8, vr0.1, sr24, vr8 ; } { f1madd vr8, vr0.1, sr23, vr8 ; } { f1madd vr8, vr0.1, sr22, vr8 ; } [32] { f1madd vr8, vr0.2, sr26, vr8 ; } { f1madd vr8, vr0.2, sr25, vr8 ; } { f1madd vr8, vr0.2, sr24, vr8 ; } { f1madd vr8, vr0.2, sr23, vr8 ; } [32] { f1madd vr8, vr0.3, sr27, vr8 ; } { f1madd vr8, vr0.3, sr26, vr8 ; } { f1madd vr8, vr0.3, sr25, vr8 ; } { f1madd vr8, vr0.3, sr24, vr8 ; } [32] { f1madd vr8, vr1.0, sr28, vr8 ; } { f1madd vr8, vr1.0, sr27, vr8 ; } { f1madd vr8, vr1.0, sr26, vr8 ; } { f1madd vr8, vr1.0, sr25, vr8 ; } [ 1] { f1mul vr9, vr0.0, sr20 ; st8u vr8, sr9, sr5 } { f1madd vr8, vr1.1, sr28, vr8 ; } { f1madd vr8, vr1.1, sr27, vr8 ; } { f1madd vr8, vr1.1, sr26, vr8 ; } [31] { f1mul(c) vr9, vr0.0, sr20 ; st8u(c) vr8, <sr9, sr3 } { f1madd(c) vr8, vr1.1, sr28, vr8 ; } { f1madd(c) vr8, vr1.1, sr27, vr8 ; } { f1madd(c) vr8, vr1.1, sr26, vr8 ; } [ 1] { f1madd vr9, vr0.1, sr21, vr9 ; ld8u vr2, sr8, sr5 } { f1mul vr9, vr0.1, sr20 ; st8u vr8, sr9, sr5 } { f1madd vr8, vr1.2, sr28, vr8 ; } { f1madd vr8, vr1.2, sr27, vr8 ; } [31] { f1madd(c) vr9, vr0.1, sr21, vr9 ; ld8u(c) vr2, <sr8, sr3 } { f1mul(c) vr9, vr0.1, sr20 ; st8u(c) vr8, <sr9, sr3 } { f1madd(c) vr8, vr1.2, sr28, vr8 ; } { f1madd(c) vr8, vr1.2, sr27, vr8 ; } [ 1] { f1madd vr9, vr0.2, sr22, vr9 ; } { f1madd vr9, vr0.2, sr21, vr9 ; ld8u vr2, sr8, sr5 } { f1mul vr9, vr0.2, sr20 ; st8u vr8, sr9, sr5 } { f1madd vr8, vr1.3, sr28, vr8 ; } [31] { f1madd(c) vr9, vr0.2, sr22, vr9 ; } { f1madd(c) vr9, vr0.2, sr21, vr9 ; ld8u(c) vr2, <sr8, sr3 } { f1mul(c) vr9, vr0.2, sr20 ; st8u(c) vr8, <sr9, sr3 } { f1madd(c) vr8, vr1.3, sr28, vr8 ; } [ 1] { f1madd vr9, vr0.3, sr23, vr9 ; } { f1madd vr9, vr0.3, sr22, vr9 ; } { f1madd vr9, vr0.3, sr21, vr9 ; ld8u vr2, sr8, sr5 } { f1mul vr9, vr0.3, sr20 ; st8u vr8, sr9, sr5 } [31] { f1madd(c) vr9, vr0.3, sr23, vr9 ; } { f1madd(c) vr9, vr0.3, sr22, vr9 ; } { f1madd(c) vr9, vr0.3, sr21, vr9 ; ld8u(c) vr2, <sr8, sr3 } { f1mul(c) vr9, vr0.3, sr20 ; st8u(c) vr8, <sr9, sr3 } [ 1] { f1madd vr9, vr1.0, sr24, vr9 ; } { f1madd vr9, vr1.0, sr23, vr9 ; } { f1madd vr9, vr1.0, sr22, vr9 ; } { f1madd vr9, vr1.0, sr21, vr9 ; ld8u vr2, sr8, sr5 } [31] { f1madd(c) vr9, vr1.0, sr24, vr9 ; } { f1madd(c) vr9, vr1.0, sr23, vr9 ; } { f1madd(c) vr9, vr1.0, sr22, vr9 ; } { f1madd(c) vr9, vr1.0, sr21, vr9 ; ld8u(c) vr2, <sr8, sr3 } [32] { f1madd vr9, vr1.1, sr25, vr9 ; } { f1madd vr9, vr1.1, sr24, vr9 ; } { f1madd vr9, vr1.1, sr23, vr9 ; } { f1madd vr9, vr1.1, sr22, vr9 ; } [32] { f1madd vr9, vr1.2, sr26, vr9 ; } { f1madd vr9, vr1.2, sr25, vr9 ; } { f1madd vr9, vr1.2, sr24, vr9 ; } { f1madd vr9, vr1.2, sr23, vr9 ; } [32] { f1madd vr9, vr1.3, sr27, vr9 ; } { f1madd vr9, vr1.3, sr26, vr9 ; } { f1madd vr9, vr1.3, sr25, vr9 ; } { f1madd vr9, vr1.3, sr24, vr9 ; } [32] { f1madd vr9, vr2.0, sr28, vr9 ; } { f1madd vr9, vr2.0, sr27, vr9 ; } { f1madd vr9, vr2.0, sr26, vr9 ; } { f1madd vr9, vr2.0, sr25, vr9 ; } [ 1] { f1mul vr8, vr1.0, sr20 ; st8u vr9, sr9, sr5 } { f1madd vr9, vr2.1, sr28, vr9 ; } { f1madd vr9, vr2.1, sr27, vr9 ; } { f1madd vr9, vr2.1, sr26, vr9 ; } [31] { f1mul(c) vr8, vr1.0, sr20 ; st8u(c) vr9, <sr9, sr3 } { f1madd(c) vr9, vr2.1, sr28, vr9 ; } { f1madd(c) vr9, vr2.1, sr27, vr9 ; } { f1madd(c) vr9, vr2.1, sr26, vr9 ; } [ 1] { f1madd vr8, vr1.1, sr21, vr8 ; ld8u vr3, sr8, sr5 } { f1mul vr8, vr1.1, sr20 ; st8u vr9, sr9, sr5 } { f1madd vr9, vr2.2, sr28, vr9 ; } { f1madd vr9, vr2.2, sr27, vr9 ; } [31] { f1madd(c) vr8, vr1.1, sr21, vr8 ; ld8u(c) vr3, <sr8, sr3 } { f1mul(c) vr8, vr1.1, sr20 ; st8u(c) vr9, <sr9, sr3 } { f1madd(c) vr9, vr2.2, sr28, vr9 ; } { f1madd(c) vr9, vr2.2, sr27, vr9 ; } [ 1] { f1madd vr8, vr1.2, sr22, vr8 ; } { f1madd vr8, vr1.2, sr21, vr8 ; ld8u vr3, sr8, sr5 } { f1mul vr8, vr1.2, sr20 ; st8u vr9, sr9, sr5 } { f1madd vr9, vr2.3, sr28, vr9 ; } [31] { f1madd(c) vr8, vr1.2, sr22, vr8 ; } { f1madd(c) vr8, vr1.2, sr21, vr8 ; ld8u(c) vr3, <sr8, sr3 } { f1mul(c) vr8, vr1.2, sr20 ; st8u(c) vr9, <sr9, sr3 } { f1madd(c) vr9, vr2.3, sr28, vr9 ; } [ 1] { f1madd vr8, vr1.3, sr23, vr8 ; } { f1madd vr8, vr1.3, sr22, vr8 ; } { f1madd vr8, vr1.3, sr21, vr8 ; ld8u vr3, sr8, sr5 } { f1mul vr8, vr1.3, sr20 ; st8u vr9, sr9, sr5 } 2016 [31] { f1madd(c) Aug 23 vr8, vr1.3, sr23, vr8 ; } { f1madd(c) vr8, vr1.3, sr22, vr8 Paul ; } { f1madd(c) F Baumeister vr8, vr1.3, sr21, vr8 ; ld8u(c) vr3, <sr8, sr3 } { f1mul(c) vr8, vr1.3, sr20 ; st8u(c) vr9, <sr9, 12 sr3 } [ 1] { f1madd vr8, vr2.0, sr24, vr8 ; } { f1madd vr8, vr2.0, sr23, vr8 ; } { f1madd vr8, vr2.0, sr22, vr8 ; } { f1madd vr8, vr2.0, sr21, vr8 ; ld8u vr3, sr8, sr5 }

Horizontal microcode for the Code generation using the C-preprocessor ALU0 LSU0 ALU1 LSU1 ALU2 LSU2 ALU3 LSU3 // begin warm-up phase! [ 1] { ; Ldr(3) } { ; } { ; } { mtspr CTR, ITER ; }! [31] { ; Ldc(3) } { ; } { ; } { ; }! [ 1] { mur(8,0,3,0) ; Ld1(0) } { ; Ldr(3) } { ; } { ; }! [31] { muc(8,0,3,0) ; Ldc(0) } { ; Ldc(3) } { ; } { ; }! [ 1] { Mar(8,1,3,1) ; Ld1(1) } { mur(8,0,3,1) ; Ld1(0) } { ; Ldr(3) } { ; }! [31] { Mac(8,1,3,1) ; Ldc(1) } { muc(8,0,3,1) ; Ldc(0) } { ; Ldc(3) } { ; }! [ 1] { Mar(8,2,3,2) ; } { Mar(8,1,3,2) ; Ld1(1) } { mur(8,0,3,2) ; Ld1(0) } { ; Ldr(3) }! [31] { Mac(8,2,3,2) ; } { Mac(8,1,3,2) ; Ldc(1) } { muc(8,0,3,2) ; Ldc(0) } { ; Ldc(3) }! [ 1] { Mar(8,3,3,3) ; } { Mar(8,2,3,3) ; } { Mar(8,1,3,3) ; Ld1(1) } { mur(8,0,3,3) ; Ld1(0) }! [31] { Mac(8,3,3,3) ; } { Mac(8,2,3,3) ; } { Mac(8,1,3,3) ; Ldc(1) } { muc(8,0,3,3) ; Ldc(0) }! [ 1] { Mar(8,4,0,0) ; } { Mar(8,3,0,0) ; } { Mar(8,2,0,0) ; } { Mar(8,1,0,0) ; Ld1(1) }! [31] { Mac(8,4,0,0) ; } { Mac(8,3,0,0) ; } { Mac(8,2,0,0) ; } { Mac(8,1,0,0) ; Ldc(1) }! [32] { Mar(8,5,0,1) ; } { Mar(8,4,0,1) ; } { Mar(8,3,0,1) ; } { Mar(8,2,0,1) ; }! [32] { Mar(8,6,0,2) ; } { Mar(8,5,0,2) ; } { Mar(8,4,0,2) ; } { Mar(8,3,0,2) ; }! [32] { Mar(8,7,0,3) ; } { Mar(8,6,0,3) ; } { Mar(8,5,0,3) ; } { Mar(8,4,0,3) ; }! [32] { Mar(8,8,1,0) ; } { Mar(8,7,1,0) ; } { Mar(8,6,1,0) ; } { Mar(8,5,1,0) ; }! [ 1] { mur(9,0,0,0) ; Str(8) } { Mar(8,8,1,1) ; } { Mar(8,7,1,1) ; } { Mar(8,6,1,1) ; }! [31] { muc(9,0,0,0) ; Stc(8) } { Mac(8,8,1,1) ; } { Mac(8,7,1,1) ; } { Mac(8,6,1,1) ; }! [ 1] { Mar(9,1,0,1) ; Ldr(2) } { mur(9,0,0,1) ; Str(8) } { Mar(8,8,1,2) ; } { Mar(8,7,1,2) ; }! [31] { Mac(9,1,0,1) ; Ldc(2) } { muc(9,0,0,1) ; Stc(8) } { Mac(8,8,1,2) ; } { Mac(8,7,1,2) ; }! [ 1] { Mar(9,2,0,2) ; } { Mar(9,1,0,2) ; Ldr(2) } { mur(9,0,0,2) ; Str(8) } { Mar(8,8,1,3) ; }! [31] { Mac(9,2,0,2) ; } { Mac(9,1,0,2) ; Ldc(2) } { muc(9,0,0,2) ; Stc(8) } { Mac(8,8,1,3) ; }! [ 1] { Mar(9,3,0,3) ; } { Mar(9,2,0,3) ; } { Mar(9,1,0,3) ; Ldr(2) } { mur(9,0,0,3) ; Str(8) }! [31] { Mac(9,3,0,3) ; } { Mac(9,2,0,3) ; } { Mac(9,1,0,3) ; Ldc(2) } { muc(9,0,0,3) ; Stc(8) }! [ 1] { Mar(9,4,1,0) ; } { Mar(9,3,1,0) ; } { Mar(9,2,1,0) ; } { Mar(9,1,1,0) ; Ldr(2) }! [31] { Mac(9,4,1,0) ; } { Mac(9,3,1,0) ; } { Mac(9,2,1,0) ; } { Mac(9,1,1,0) ; Ldc(2) }! [32] { Mar(9,5,1,1) ; } { Mar(9,4,1,1) ; } { Mar(9,3,1,1) ; } { Mar(9,2,1,1) ; }! [32] { Mar(9,6,1,2) ; } { Mar(9,5,1,2) ; } { Mar(9,4,1,2) ; } { Mar(9,3,1,2) ; }! [32] { Mar(9,7,1,3) ; } { Mar(9,6,1,3) ; } { Mar(9,5,1,3) ; } { Mar(9,4,1,3) ; }! [32] { Mar(9,8,2,0) ; } { Mar(9,7,2,0) ; } { Mar(9,6,2,0) ; } { Mar(9,5,2,0) ; }! 2016 Aug 23 Paul F Baumeister 13 time s0 ld A[ 4] s1 ld A[ 3] s2 ld A[ 2] s3 ld A[ 1] s0 ld A[0] s1 ld A[1] s2 ld A[2] s3 ld A[3] s0 ld A[4] s3 ld A[5] s1 ld A[6] s2 ld A[7] slice 0 slice 1 slice 2 slice 3 128 cycles s0 st T[0] s1 st T[1] s2 st T[2] s3 st T[3]

FD-kernel performance on kcycles 1000 100 10 Run cycles Stall cycles Ideal 32 3 grid 16 3 grid Effect of halo-regions (4 grid points) is stronger for 16 3 than for 32 3 (25%) Longer rows are better BW is a shared resource Scales up to 32 lanes to 43% floating point efficiency 137.6 GFlop/s per 1 2 4 8 16 32 Number of Lanes 2016 Aug 23 Paul F Baumeister 14

Alternative: Inversion of large matrices DFT based on Green functions Inversion of the Hamiltonian given as a block-sparse matrix Allows for the truncation of long-range interactions à order(n) method KKRnano FCC example of operator structure 2016 Aug 23 Paul F Baumeister 15

Block-sparse matrix-vector multiplication Performance-critical for residual minimization iterations (91% of the runtime on BG/Q) compressed row storage Index lists for blocks in C 16 16 (dp) Contraction over 13 non-zero blocks per row Requires a fast multiplication of blocks AI = 32 kiflop / 8 kibyte = 4.0 Flop/Byte N Atom 13 * = 16x16 1 2016 Aug 23 Paul F Baumeister 16

z16mm - Implementation on Kernel completely unrolled: 384 VLIWs + overheads, no branching Exploit only half of the vector register length: 16/32 All slices perform the same operations onto ¼ of the result matrix Code generation using simple Python scripts!!!!!! [ 1] {f1madd(c) imc, vr7.2, sr27, imc; ld8u sr27, ldimb, n16b}{}{}{}!...! [16] {f1madd rec, vr3.2, sr27, rec; }{}{}{}! [15] {f1madd imc, vr7.2, sr27, imc; }{}{}{}! [ 1] {f1madd(c) imc, vr4.3, sr28, imc; ld8u sr28, ldimb, n16b}{}{}{}! [ 1] {f1madd(c) imc, vr5.3, sr29, imc; ld8u sr29, ldimb, n16b}{}{}{}! [16] {f1madd rec, vr0.3, sr28, rec; }{}{}{}! [15] {f1madd imc, vr4.3, sr28, imc; }{}{}{}!! [16] {f1madd rec, vr1.3, sr29, rec; }{}{}{}! [15] {f1madd imc, vr5.3, sr29, imc; }{}{}{}! [ 1] {f1madd(c) imc, vr6.3, sr30, imc; ld8u sr30, ldimb, n16b}{}{}{}! [16] {f1madd rec, vr2.3, sr30, rec; }{}{}{}! [15] {f1madd imc, vr6.3, sr30, imc; }{}{}{}! [ 1] {f1madd(c) imc, vr7.3, sr31, imc; ld8u sr31, ldimb, n16b}{}{}{}! [ 1] {f1madd(c) imc, vr0.0, sr16, imc; ld8u sr16, ldreb, n16b}{}{}{} [16] {f1madd rec, vr3.3, sr31, rec; }{}{}{}! [15] {f1madd imc, vr7.3, sr31, imc; }{}{}{}!! // Re(C) -= Im(A)*Im(B)! // Im(C) += Re(A)*Im(B)! [16] {f1nmsub rec, vr4.0, sr16, rec; LSU10 }{}{}{}! [15] {f1madd imc, vr0.0, sr16, imc; }{}{}{}! [ 1] {f1madd(c) imc, vr1.0, sr17, imc; ld8u sr17, ldreb, n16b}{}{}{} [16] {f1nmsub rec, vr5.0, sr17, rec; LSU11 }{}{}{}! [15] {f1madd imc, vr1.0, sr17, imc; }{}{}{}! [ 1] {f1madd(c) imc, vr2.0, sr18, imc; ld8u sr18, ldreb, n16b}{}{}{} [16] {f1nmsub rec, vr6.0, sr18, rec; }{}{}{}! [15] {f1madd imc, vr2.0, sr18, imc; }{}{}{}! [ 1] {f1madd(c) imc, vr3.0, sr19, imc; ld8u sr19, ldreb, n16b}{}{}{} [16] {f1nmsub rec, vr7.0, sr19, rec; }{}{}{}! [15] {f1madd imc, vr3.0, sr19, imc; }{}{}{}! 2016 Aug 23 Paul F Baumeister 17

z16mm - Performance on Weak scaling one small matrix-matrix multiplication per lane 82% floating point efficiency à 263 GFlop/s per about 900 cycles startup, mostly stall cycles (load latencies) 5 4 kcycles 3 2 1 0 Run cycles Stall cycles 4.096 (min) 1 2 4 8 16 32 Number of Lanes 2016 Aug 23 Paul F Baumeister 18

Effect onto KKRnano Distribute independent matrix rows to lanes Chain multiplications of all non-zero blocks in a row, no need to spill the accumulator matrix à AI = 3.5 Flop/Byte Reducing the startup overhead: stack loading, indirection round trips, etc. à 98% 312 GFlop/s per A single could speed up KKRnano by 5.5x and reduce energy-tosolution by 5x (assuming a BG/Q CPU with 200 GFlop/s for 100 W) 9% 91% Kernel on CPU 5.5x For multiple s, we need to offload other kernels to exploit the system 2016 Aug 23 Paul F Baumeister 19 50% 50% on 1 2.1x 94% 6% on 16 s

Conclusions and outlook Active Memory Cube as in-memory processing architecture CPU and lanes share one address space Favorable Flop/W performance: ~ 32 GFlop/s per Watt High double-precision floating point efficiencies for matrix-matrix and stencil operations (and also other kernels 1 ) Good utilization for density functional theory and similar domains Potential target architecture for an OpenMP4 offload model Needs a smart compiler to generate efficient VLIW code 1 Accelerating LBM and LQCD Application Kernels by In-Memory Processing Baumeister, Boettiger, Brunheroto, Hater, Maurer, Nobile, Pleiter in ISC 15 proceedings 2016 Aug 23 Paul F Baumeister 20