Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications

Size: px
Start display at page:

Download "Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications"

Transcription

1 Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications 2016 Aug 23 P. F. Baumeister, T. Hater, D. Pleiter H. Boettiger, T. Maurer, J. R. Brunheroto

2 Contributors IBM R&D Lab Böblingen Thilo Maurer Hans Boettiger IBM T.J. Watson Center, NY José R Brunheroto JSC IBM Jülich Supercomputing Centre Thorsten Hater Dirk Pleiter youreuropemap.com 2016 Aug 23 Paul F Baumeister 2

3 Outline Introduction: Processing in memory Active Memory Cube design compute lane architecture Programming the Density Functional Theory Finite-Differences on Small matrix-matrix multiplications on Application improvement Conclusions, Outlook 2016 Aug 23 Paul F Baumeister 3

4 Why processing in memory Data transport becomes more expensive in terms of energy compared to compute Energy per Flop reduces for smaller feature size Data transport hardly becomes less expensive Gap opens between BW and Flop performance Possible solution: Move processing closer to memory 2016 Aug 23 Paul F Baumeister 4

5 The Active Memory Cube design IBM design based on the Hybrid Memory Cube 3D design with several memory layers and one logic layer Thermal design power: 10W per Node layout HMC, picture by IBM Host CPU Network 2016 Aug 23 Paul F Baumeister 5

6 compute lane architecture Register files with 17 kibyte total per lane 32 scalar registers, 16 vector registers á 32 entries per slice 64bit registers (2-way SIMD single precision instructions possible) Read access to vector registers of other slices enabled No caches Offload-model 1.25 GHz 10 GFlop/s (dp) 10 GByte/s 32 lanes/ 2016 Aug 23 Paul F Baumeister 6

7 Programming the Same address space as CPU Micro-coded architecture Exposed pipeline Very Long Instruction Words MIMD paradigm 1 VLIW: BU [#R] {ALU0; LSU0} {ALU1; LSU1} {ALU2; LSU2} {ALU3; LSU3}! Limited VLIW buffer (size 512) Instruction repeat count up to [32]! Cycle-accurate simulations using Mambo! 1 Flop/Byte critical arithmetic intensity 2016 Aug 23 Paul F Baumeister 7

8 Density Functional Theory Workhorse formalism for solid state physics Kernel: diagonalize or invert the Hamiltonian Ĥ = ˆT + ˆV zz + V (x,y,z) Material properties accessible by DFT electronic magnetic structural mechanical chemical thermodynamical In real-space representation, the kinetic energy operator T (Laplacian) can be constructed as short ranged, e.g. by finite-differences as in jurs 2016 Aug 23 Paul F Baumeister 8

9 High-order finite-difference derivative Second derivative in Finite-Difference with uniform grid spacing h (i) = 1 [ (i 1) 2 (i)+ (i + 1) ] h2 2 nd order in 1D Derivative Error Array halos necessary Controllable accuracy wave number k*h 4 th order stencil in 3D 2016 Aug 23 Paul F Baumeister 9 kinetic energy

10 3D FD-Hamiltonian in 8 th order 3D Laplacian stencil: data-reuse only along the direction of loop traversal è Low arithmetic intensity: 0.34 Flop/Byte Decompose the action of H = T + V into 3 passes, traverse along x, y, z, respectively fdd-vx: wx[x,y,z,:] := (Txx + V[x,y,z]) w[x,y,z,:] fdd-yz: wy[x,y,z,:] := wx[x,y,z,:] + Tyy w[x,y,z,:] fdd-yz: Hw[x,y,z,:] := wy[x,y,z,:] + Tzz w[x,y,z,:] [:] = Vectorize over 32 independent w s 1.1 Flop/Byte 0.7 Flop/Byte 0.7 Flop/Byte 2016 Aug 23 Paul F Baumeister 10

11 1D finite-differences on Parallelization over slices with a phase shift Manual tuning of delay between load and use Halo regions infer a constant load overhead s0 ld A[ 4] s1 ld A[ 3] s2 ld A[ 2] s3 ld A[ 1] s0 ld A[0] s1 ld A[1] s2 ld A[2] s3 ld A[3] s0 ld A[4] s3 ld A[5] s1 ld A[6] s2 ld A[7] Load source array * slice 0 * slice 1 * slice 2 * slice cycles Vectorization over 32 independent w s time FMAs Example for a very short row with only 4 grid points Store target array 2016 Aug 23 Paul F Baumeister 11 s0 st T[0] s1 st T[1] s2 st T[2] s3 st T[3]

12 Horizontal microcode for the Manual register allocation L: [ 1] { f1mul vr8, vr3.0, sr20 ; st8u vr9, sr9, sr5 } { f1madd vr9, vr0.1, sr28, vr9 ; } { f1madd vr9, vr0.1, sr27, vr9 ; } { f1madd vr9, vr0.1, sr26, vr9 ; } [31] { f1mul(c) vr8, vr3.0, sr20 ; st8u(c) vr9, <sr9, sr3 } { f1madd(c) vr9, vr0.1, sr28, vr9 ; } { f1madd(c) vr9, vr0.1, sr27, vr9 ; } { f1madd(c) vr9, vr0.1, sr26, vr9 ; } [ 1] { f1madd vr8, vr3.1, sr21, vr8 ; ld8u vr1, sr8, sr5 } { f1mul vr8, vr3.1, sr20 ; st8u vr9, sr9, sr5 } { f1madd vr9, vr0.2, sr28, vr9 ; } { f1madd vr9, vr0.2, sr27, vr9 ; } [31] { f1madd(c) vr8, vr3.1, sr21, vr8 ; ld8u(c) vr1, <sr8, sr3 } { f1mul(c) vr8, vr3.1, sr20 ; st8u(c) vr9, <sr9, sr3 } { f1madd(c) vr9, vr0.2, sr28, vr9 ; } { f1madd(c) vr9, vr0.2, sr27, vr9 ; } [ 1] { f1madd vr8, vr3.2, sr22, vr8 ; } { f1madd vr8, vr3.2, sr21, vr8 ; ld8u vr1, sr8, sr5 } { f1mul vr8, vr3.2, sr20 ; st8u vr9, sr9, sr5 } { f1madd vr9, vr0.3, sr28, vr9 ; } [31] { f1madd(c) vr8, vr3.2, sr22, vr8 ; } { f1madd(c) vr8, vr3.2, sr21, vr8 ; ld8u(c) vr1, <sr8, sr3 } { f1mul(c) vr8, vr3.2, sr20 ; st8u(c) vr9, <sr9, sr3 } { f1madd(c) vr9, vr0.3, sr28, vr9 ; } [ 1] { f1madd vr8, vr3.3, sr23, vr8 ; } { f1madd vr8, vr3.3, sr22, vr8 ; } { f1madd vr8, vr3.3, sr21, vr8 ; ld8u vr1, sr8, sr5 } { f1mul vr8, vr3.3, sr20 ; st8u vr9, sr9, sr5 } [31] { f1madd(c) vr8, vr3.3, sr23, vr8 ; } { f1madd(c) vr8, vr3.3, sr22, vr8 ; } { f1madd(c) vr8, vr3.3, sr21, vr8 ; ld8u(c) vr1, <sr8, sr3 } { f1mul(c) vr8, vr3.3, sr20 ; st8u(c) vr9, <sr9, sr3 } [ 1] { f1madd vr8, vr0.0, sr24, vr8 ; } { f1madd vr8, vr0.0, sr23, vr8 ; } { f1madd vr8, vr0.0, sr22, vr8 ; } { f1madd vr8, vr0.0, sr21, vr8 ; ld8u vr1, sr8, sr5 } [31] { f1madd(c) vr8, vr0.0, sr24, vr8 ; } { f1madd(c) vr8, vr0.0, sr23, vr8 ; } { f1madd(c) vr8, vr0.0, sr22, vr8 ; } { f1madd(c) vr8, vr0.0, sr21, vr8 ; ld8u(c) vr1, <sr8, sr3 } [32] { f1madd vr8, vr0.1, sr25, vr8 ; } { f1madd vr8, vr0.1, sr24, vr8 ; } { f1madd vr8, vr0.1, sr23, vr8 ; } { f1madd vr8, vr0.1, sr22, vr8 ; } [32] { f1madd vr8, vr0.2, sr26, vr8 ; } { f1madd vr8, vr0.2, sr25, vr8 ; } { f1madd vr8, vr0.2, sr24, vr8 ; } { f1madd vr8, vr0.2, sr23, vr8 ; } [32] { f1madd vr8, vr0.3, sr27, vr8 ; } { f1madd vr8, vr0.3, sr26, vr8 ; } { f1madd vr8, vr0.3, sr25, vr8 ; } { f1madd vr8, vr0.3, sr24, vr8 ; } [32] { f1madd vr8, vr1.0, sr28, vr8 ; } { f1madd vr8, vr1.0, sr27, vr8 ; } { f1madd vr8, vr1.0, sr26, vr8 ; } { f1madd vr8, vr1.0, sr25, vr8 ; } [ 1] { f1mul vr9, vr0.0, sr20 ; st8u vr8, sr9, sr5 } { f1madd vr8, vr1.1, sr28, vr8 ; } { f1madd vr8, vr1.1, sr27, vr8 ; } { f1madd vr8, vr1.1, sr26, vr8 ; } [31] { f1mul(c) vr9, vr0.0, sr20 ; st8u(c) vr8, <sr9, sr3 } { f1madd(c) vr8, vr1.1, sr28, vr8 ; } { f1madd(c) vr8, vr1.1, sr27, vr8 ; } { f1madd(c) vr8, vr1.1, sr26, vr8 ; } [ 1] { f1madd vr9, vr0.1, sr21, vr9 ; ld8u vr2, sr8, sr5 } { f1mul vr9, vr0.1, sr20 ; st8u vr8, sr9, sr5 } { f1madd vr8, vr1.2, sr28, vr8 ; } { f1madd vr8, vr1.2, sr27, vr8 ; } [31] { f1madd(c) vr9, vr0.1, sr21, vr9 ; ld8u(c) vr2, <sr8, sr3 } { f1mul(c) vr9, vr0.1, sr20 ; st8u(c) vr8, <sr9, sr3 } { f1madd(c) vr8, vr1.2, sr28, vr8 ; } { f1madd(c) vr8, vr1.2, sr27, vr8 ; } [ 1] { f1madd vr9, vr0.2, sr22, vr9 ; } { f1madd vr9, vr0.2, sr21, vr9 ; ld8u vr2, sr8, sr5 } { f1mul vr9, vr0.2, sr20 ; st8u vr8, sr9, sr5 } { f1madd vr8, vr1.3, sr28, vr8 ; } [31] { f1madd(c) vr9, vr0.2, sr22, vr9 ; } { f1madd(c) vr9, vr0.2, sr21, vr9 ; ld8u(c) vr2, <sr8, sr3 } { f1mul(c) vr9, vr0.2, sr20 ; st8u(c) vr8, <sr9, sr3 } { f1madd(c) vr8, vr1.3, sr28, vr8 ; } [ 1] { f1madd vr9, vr0.3, sr23, vr9 ; } { f1madd vr9, vr0.3, sr22, vr9 ; } { f1madd vr9, vr0.3, sr21, vr9 ; ld8u vr2, sr8, sr5 } { f1mul vr9, vr0.3, sr20 ; st8u vr8, sr9, sr5 } [31] { f1madd(c) vr9, vr0.3, sr23, vr9 ; } { f1madd(c) vr9, vr0.3, sr22, vr9 ; } { f1madd(c) vr9, vr0.3, sr21, vr9 ; ld8u(c) vr2, <sr8, sr3 } { f1mul(c) vr9, vr0.3, sr20 ; st8u(c) vr8, <sr9, sr3 } [ 1] { f1madd vr9, vr1.0, sr24, vr9 ; } { f1madd vr9, vr1.0, sr23, vr9 ; } { f1madd vr9, vr1.0, sr22, vr9 ; } { f1madd vr9, vr1.0, sr21, vr9 ; ld8u vr2, sr8, sr5 } [31] { f1madd(c) vr9, vr1.0, sr24, vr9 ; } { f1madd(c) vr9, vr1.0, sr23, vr9 ; } { f1madd(c) vr9, vr1.0, sr22, vr9 ; } { f1madd(c) vr9, vr1.0, sr21, vr9 ; ld8u(c) vr2, <sr8, sr3 } [32] { f1madd vr9, vr1.1, sr25, vr9 ; } { f1madd vr9, vr1.1, sr24, vr9 ; } { f1madd vr9, vr1.1, sr23, vr9 ; } { f1madd vr9, vr1.1, sr22, vr9 ; } [32] { f1madd vr9, vr1.2, sr26, vr9 ; } { f1madd vr9, vr1.2, sr25, vr9 ; } { f1madd vr9, vr1.2, sr24, vr9 ; } { f1madd vr9, vr1.2, sr23, vr9 ; } [32] { f1madd vr9, vr1.3, sr27, vr9 ; } { f1madd vr9, vr1.3, sr26, vr9 ; } { f1madd vr9, vr1.3, sr25, vr9 ; } { f1madd vr9, vr1.3, sr24, vr9 ; } [32] { f1madd vr9, vr2.0, sr28, vr9 ; } { f1madd vr9, vr2.0, sr27, vr9 ; } { f1madd vr9, vr2.0, sr26, vr9 ; } { f1madd vr9, vr2.0, sr25, vr9 ; } [ 1] { f1mul vr8, vr1.0, sr20 ; st8u vr9, sr9, sr5 } { f1madd vr9, vr2.1, sr28, vr9 ; } { f1madd vr9, vr2.1, sr27, vr9 ; } { f1madd vr9, vr2.1, sr26, vr9 ; } [31] { f1mul(c) vr8, vr1.0, sr20 ; st8u(c) vr9, <sr9, sr3 } { f1madd(c) vr9, vr2.1, sr28, vr9 ; } { f1madd(c) vr9, vr2.1, sr27, vr9 ; } { f1madd(c) vr9, vr2.1, sr26, vr9 ; } [ 1] { f1madd vr8, vr1.1, sr21, vr8 ; ld8u vr3, sr8, sr5 } { f1mul vr8, vr1.1, sr20 ; st8u vr9, sr9, sr5 } { f1madd vr9, vr2.2, sr28, vr9 ; } { f1madd vr9, vr2.2, sr27, vr9 ; } [31] { f1madd(c) vr8, vr1.1, sr21, vr8 ; ld8u(c) vr3, <sr8, sr3 } { f1mul(c) vr8, vr1.1, sr20 ; st8u(c) vr9, <sr9, sr3 } { f1madd(c) vr9, vr2.2, sr28, vr9 ; } { f1madd(c) vr9, vr2.2, sr27, vr9 ; } [ 1] { f1madd vr8, vr1.2, sr22, vr8 ; } { f1madd vr8, vr1.2, sr21, vr8 ; ld8u vr3, sr8, sr5 } { f1mul vr8, vr1.2, sr20 ; st8u vr9, sr9, sr5 } { f1madd vr9, vr2.3, sr28, vr9 ; } [31] { f1madd(c) vr8, vr1.2, sr22, vr8 ; } { f1madd(c) vr8, vr1.2, sr21, vr8 ; ld8u(c) vr3, <sr8, sr3 } { f1mul(c) vr8, vr1.2, sr20 ; st8u(c) vr9, <sr9, sr3 } { f1madd(c) vr9, vr2.3, sr28, vr9 ; } [ 1] { f1madd vr8, vr1.3, sr23, vr8 ; } { f1madd vr8, vr1.3, sr22, vr8 ; } { f1madd vr8, vr1.3, sr21, vr8 ; ld8u vr3, sr8, sr5 } { f1mul vr8, vr1.3, sr20 ; st8u vr9, sr9, sr5 } 2016 [31] { f1madd(c) Aug 23 vr8, vr1.3, sr23, vr8 ; } { f1madd(c) vr8, vr1.3, sr22, vr8 Paul ; } { f1madd(c) F Baumeister vr8, vr1.3, sr21, vr8 ; ld8u(c) vr3, <sr8, sr3 } { f1mul(c) vr8, vr1.3, sr20 ; st8u(c) vr9, <sr9, 12 sr3 } [ 1] { f1madd vr8, vr2.0, sr24, vr8 ; } { f1madd vr8, vr2.0, sr23, vr8 ; } { f1madd vr8, vr2.0, sr22, vr8 ; } { f1madd vr8, vr2.0, sr21, vr8 ; ld8u vr3, sr8, sr5 }

13 Horizontal microcode for the Code generation using the C-preprocessor ALU0 LSU0 ALU1 LSU1 ALU2 LSU2 ALU3 LSU3 // begin warm-up phase! [ 1] { ; Ldr(3) } { ; } { ; } { mtspr CTR, ITER ; }! [31] { ; Ldc(3) } { ; } { ; } { ; }! [ 1] { mur(8,0,3,0) ; Ld1(0) } { ; Ldr(3) } { ; } { ; }! [31] { muc(8,0,3,0) ; Ldc(0) } { ; Ldc(3) } { ; } { ; }! [ 1] { Mar(8,1,3,1) ; Ld1(1) } { mur(8,0,3,1) ; Ld1(0) } { ; Ldr(3) } { ; }! [31] { Mac(8,1,3,1) ; Ldc(1) } { muc(8,0,3,1) ; Ldc(0) } { ; Ldc(3) } { ; }! [ 1] { Mar(8,2,3,2) ; } { Mar(8,1,3,2) ; Ld1(1) } { mur(8,0,3,2) ; Ld1(0) } { ; Ldr(3) }! [31] { Mac(8,2,3,2) ; } { Mac(8,1,3,2) ; Ldc(1) } { muc(8,0,3,2) ; Ldc(0) } { ; Ldc(3) }! [ 1] { Mar(8,3,3,3) ; } { Mar(8,2,3,3) ; } { Mar(8,1,3,3) ; Ld1(1) } { mur(8,0,3,3) ; Ld1(0) }! [31] { Mac(8,3,3,3) ; } { Mac(8,2,3,3) ; } { Mac(8,1,3,3) ; Ldc(1) } { muc(8,0,3,3) ; Ldc(0) }! [ 1] { Mar(8,4,0,0) ; } { Mar(8,3,0,0) ; } { Mar(8,2,0,0) ; } { Mar(8,1,0,0) ; Ld1(1) }! [31] { Mac(8,4,0,0) ; } { Mac(8,3,0,0) ; } { Mac(8,2,0,0) ; } { Mac(8,1,0,0) ; Ldc(1) }! [32] { Mar(8,5,0,1) ; } { Mar(8,4,0,1) ; } { Mar(8,3,0,1) ; } { Mar(8,2,0,1) ; }! [32] { Mar(8,6,0,2) ; } { Mar(8,5,0,2) ; } { Mar(8,4,0,2) ; } { Mar(8,3,0,2) ; }! [32] { Mar(8,7,0,3) ; } { Mar(8,6,0,3) ; } { Mar(8,5,0,3) ; } { Mar(8,4,0,3) ; }! [32] { Mar(8,8,1,0) ; } { Mar(8,7,1,0) ; } { Mar(8,6,1,0) ; } { Mar(8,5,1,0) ; }! [ 1] { mur(9,0,0,0) ; Str(8) } { Mar(8,8,1,1) ; } { Mar(8,7,1,1) ; } { Mar(8,6,1,1) ; }! [31] { muc(9,0,0,0) ; Stc(8) } { Mac(8,8,1,1) ; } { Mac(8,7,1,1) ; } { Mac(8,6,1,1) ; }! [ 1] { Mar(9,1,0,1) ; Ldr(2) } { mur(9,0,0,1) ; Str(8) } { Mar(8,8,1,2) ; } { Mar(8,7,1,2) ; }! [31] { Mac(9,1,0,1) ; Ldc(2) } { muc(9,0,0,1) ; Stc(8) } { Mac(8,8,1,2) ; } { Mac(8,7,1,2) ; }! [ 1] { Mar(9,2,0,2) ; } { Mar(9,1,0,2) ; Ldr(2) } { mur(9,0,0,2) ; Str(8) } { Mar(8,8,1,3) ; }! [31] { Mac(9,2,0,2) ; } { Mac(9,1,0,2) ; Ldc(2) } { muc(9,0,0,2) ; Stc(8) } { Mac(8,8,1,3) ; }! [ 1] { Mar(9,3,0,3) ; } { Mar(9,2,0,3) ; } { Mar(9,1,0,3) ; Ldr(2) } { mur(9,0,0,3) ; Str(8) }! [31] { Mac(9,3,0,3) ; } { Mac(9,2,0,3) ; } { Mac(9,1,0,3) ; Ldc(2) } { muc(9,0,0,3) ; Stc(8) }! [ 1] { Mar(9,4,1,0) ; } { Mar(9,3,1,0) ; } { Mar(9,2,1,0) ; } { Mar(9,1,1,0) ; Ldr(2) }! [31] { Mac(9,4,1,0) ; } { Mac(9,3,1,0) ; } { Mac(9,2,1,0) ; } { Mac(9,1,1,0) ; Ldc(2) }! [32] { Mar(9,5,1,1) ; } { Mar(9,4,1,1) ; } { Mar(9,3,1,1) ; } { Mar(9,2,1,1) ; }! [32] { Mar(9,6,1,2) ; } { Mar(9,5,1,2) ; } { Mar(9,4,1,2) ; } { Mar(9,3,1,2) ; }! [32] { Mar(9,7,1,3) ; } { Mar(9,6,1,3) ; } { Mar(9,5,1,3) ; } { Mar(9,4,1,3) ; }! [32] { Mar(9,8,2,0) ; } { Mar(9,7,2,0) ; } { Mar(9,6,2,0) ; } { Mar(9,5,2,0) ; }! 2016 Aug 23 Paul F Baumeister 13 time s0 ld A[ 4] s1 ld A[ 3] s2 ld A[ 2] s3 ld A[ 1] s0 ld A[0] s1 ld A[1] s2 ld A[2] s3 ld A[3] s0 ld A[4] s3 ld A[5] s1 ld A[6] s2 ld A[7] slice 0 slice 1 slice 2 slice cycles s0 st T[0] s1 st T[1] s2 st T[2] s3 st T[3]

14 FD-kernel performance on kcycles Run cycles Stall cycles Ideal 32 3 grid 16 3 grid Effect of halo-regions (4 grid points) is stronger for 16 3 than for 32 3 (25%) Longer rows are better BW is a shared resource Scales up to 32 lanes to 43% floating point efficiency GFlop/s per Number of Lanes 2016 Aug 23 Paul F Baumeister 14

15 Alternative: Inversion of large matrices DFT based on Green functions Inversion of the Hamiltonian given as a block-sparse matrix Allows for the truncation of long-range interactions à order(n) method KKRnano FCC example of operator structure 2016 Aug 23 Paul F Baumeister 15

16 Block-sparse matrix-vector multiplication Performance-critical for residual minimization iterations (91% of the runtime on BG/Q) compressed row storage Index lists for blocks in C (dp) Contraction over 13 non-zero blocks per row Requires a fast multiplication of blocks AI = 32 kiflop / 8 kibyte = 4.0 Flop/Byte N Atom 13 * = 16x Aug 23 Paul F Baumeister 16

17 z16mm - Implementation on Kernel completely unrolled: 384 VLIWs + overheads, no branching Exploit only half of the vector register length: 16/32 All slices perform the same operations onto ¼ of the result matrix Code generation using simple Python scripts!!!!!! [ 1] {f1madd(c) imc, vr7.2, sr27, imc; ld8u sr27, ldimb, n16b}{}{}{}!...! [16] {f1madd rec, vr3.2, sr27, rec; }{}{}{}! [15] {f1madd imc, vr7.2, sr27, imc; }{}{}{}! [ 1] {f1madd(c) imc, vr4.3, sr28, imc; ld8u sr28, ldimb, n16b}{}{}{}! [ 1] {f1madd(c) imc, vr5.3, sr29, imc; ld8u sr29, ldimb, n16b}{}{}{}! [16] {f1madd rec, vr0.3, sr28, rec; }{}{}{}! [15] {f1madd imc, vr4.3, sr28, imc; }{}{}{}!! [16] {f1madd rec, vr1.3, sr29, rec; }{}{}{}! [15] {f1madd imc, vr5.3, sr29, imc; }{}{}{}! [ 1] {f1madd(c) imc, vr6.3, sr30, imc; ld8u sr30, ldimb, n16b}{}{}{}! [16] {f1madd rec, vr2.3, sr30, rec; }{}{}{}! [15] {f1madd imc, vr6.3, sr30, imc; }{}{}{}! [ 1] {f1madd(c) imc, vr7.3, sr31, imc; ld8u sr31, ldimb, n16b}{}{}{}! [ 1] {f1madd(c) imc, vr0.0, sr16, imc; ld8u sr16, ldreb, n16b}{}{}{} [16] {f1madd rec, vr3.3, sr31, rec; }{}{}{}! [15] {f1madd imc, vr7.3, sr31, imc; }{}{}{}!! // Re(C) -= Im(A)*Im(B)! // Im(C) += Re(A)*Im(B)! [16] {f1nmsub rec, vr4.0, sr16, rec; LSU10 }{}{}{}! [15] {f1madd imc, vr0.0, sr16, imc; }{}{}{}! [ 1] {f1madd(c) imc, vr1.0, sr17, imc; ld8u sr17, ldreb, n16b}{}{}{} [16] {f1nmsub rec, vr5.0, sr17, rec; LSU11 }{}{}{}! [15] {f1madd imc, vr1.0, sr17, imc; }{}{}{}! [ 1] {f1madd(c) imc, vr2.0, sr18, imc; ld8u sr18, ldreb, n16b}{}{}{} [16] {f1nmsub rec, vr6.0, sr18, rec; }{}{}{}! [15] {f1madd imc, vr2.0, sr18, imc; }{}{}{}! [ 1] {f1madd(c) imc, vr3.0, sr19, imc; ld8u sr19, ldreb, n16b}{}{}{} [16] {f1nmsub rec, vr7.0, sr19, rec; }{}{}{}! [15] {f1madd imc, vr3.0, sr19, imc; }{}{}{}! 2016 Aug 23 Paul F Baumeister 17

18 z16mm - Performance on Weak scaling one small matrix-matrix multiplication per lane 82% floating point efficiency à 263 GFlop/s per about 900 cycles startup, mostly stall cycles (load latencies) 5 4 kcycles Run cycles Stall cycles (min) Number of Lanes 2016 Aug 23 Paul F Baumeister 18

19 Effect onto KKRnano Distribute independent matrix rows to lanes Chain multiplications of all non-zero blocks in a row, no need to spill the accumulator matrix à AI = 3.5 Flop/Byte Reducing the startup overhead: stack loading, indirection round trips, etc. à 98% 312 GFlop/s per A single could speed up KKRnano by 5.5x and reduce energy-tosolution by 5x (assuming a BG/Q CPU with 200 GFlop/s for 100 W) 9% 91% Kernel on CPU 5.5x For multiple s, we need to offload other kernels to exploit the system 2016 Aug 23 Paul F Baumeister 19 50% 50% on 1 2.1x 94% 6% on 16 s

20 Conclusions and outlook Active Memory Cube as in-memory processing architecture CPU and lanes share one address space Favorable Flop/W performance: ~ 32 GFlop/s per Watt High double-precision floating point efficiencies for matrix-matrix and stencil operations (and also other kernels 1 ) Good utilization for density functional theory and similar domains Potential target architecture for an OpenMP4 offload model Needs a smart compiler to generate efficient VLIW code 1 Accelerating LBM and LQCD Application Kernels by In-Memory Processing Baumeister, Boettiger, Brunheroto, Hater, Maurer, Nobile, Pleiter in ISC 15 proceedings 2016 Aug 23 Paul F Baumeister 20

Performance Evaluation of Scientific Applications on POWER8

Performance Evaluation of Scientific Applications on POWER8 Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC International Supercomputing Conference 2013 591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC W. Eckhardt TUM, A. Heinecke TUM, R. Bader LRZ, M. Brehm LRZ, N. Hammer LRZ, H. Huber LRZ, H.-G.

More information

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich, The Blue Gene/P at Jülich Case Study & Optimization W.Frings, Forschungszentrum Jülich, 26.08.2008 Jugene Case-Studies: Overview Case Study: PEPC Case Study: racoon Case Study: QCD CPU0CPU3 CPU1CPU2 2

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

ALU A functional unit

ALU A functional unit ALU A functional unit that performs arithmetic operations such as ADD, SUB, MPY logical operations such as AND, OR, XOR, NOT on given data types: 8-,16-,32-, or 64-bit values A n-1 A n-2... A 1 A 0 B n-1

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Algorithms and Methods for Fast Model Predictive Control

Algorithms and Methods for Fast Model Predictive Control Algorithms and Methods for Fast Model Predictive Control Technical University of Denmark Department of Applied Mathematics and Computer Science 13 April 2016 Background: Model Predictive Control Model

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

Empowering Scientists with Domain Specific Languages

Empowering Scientists with Domain Specific Languages Empowering Scientists with Domain Specific Languages Julian Kunkel, Nabeeh Jum ah Scientific Computing Department of Informatics University of Hamburg SciCADE2017 2017-09-13 Outline 1 Developing Scientific

More information

Software optimization for petaflops/s scale Quantum Monte Carlo simulations

Software optimization for petaflops/s scale Quantum Monte Carlo simulations Software optimization for petaflops/s scale Quantum Monte Carlo simulations A. Scemama 1, M. Caffarel 1, E. Oseret 2, W. Jalby 2 1 Laboratoire de Chimie et Physique Quantiques / IRSAMC, Toulouse, France

More information

Introduction The Nature of High-Performance Computation

Introduction The Nature of High-Performance Computation 1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential

More information

Cost/Performance Tradeoffs:

Cost/Performance Tradeoffs: Cost/Performance Tradeoffs: a case study Digital Systems Architecture I. L10 - Multipliers 1 Binary Multiplication x a b n bits n bits EASY PROBLEM: design combinational circuit to multiply tiny (1-, 2-,

More information

Mapping Sparse Matrix-Vector Multiplication on FPGAs

Mapping Sparse Matrix-Vector Multiplication on FPGAs Mapping Sparse Matrix-Vector Multiplication on FPGAs Junqing Sun 1, Gregory Peterson 1, Olaf Storaasli 2 1 University of Tennessee, Knoxville 2 Oak Ridge National Laboratory July 20, 2007 Outline Introduction

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] When we talk about the number of operands in an instruction (a 1-operand or a 2-operand instruction, for example), what do we mean? [2] What are the two main ways to define performance? [2] Predicting

More information

Measuring freeze-out parameters on the Bielefeld GPU cluster

Measuring freeze-out parameters on the Bielefeld GPU cluster Measuring freeze-out parameters on the Bielefeld GPU cluster Outline Fluctuations and the QCD phase diagram Fluctuations from Lattice QCD The Bielefeld hybrid GPU cluster Freeze-out conditions from QCD

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Compiling Techniques

Compiling Techniques Lecture 11: Introduction to 13 November 2015 Table of contents 1 Introduction Overview The Backend The Big Picture 2 Code Shape Overview Introduction Overview The Backend The Big Picture Source code FrontEnd

More information

Performance Analysis of a List-Based Lattice-Boltzmann Kernel

Performance Analysis of a List-Based Lattice-Boltzmann Kernel Performance Analysis of a List-Based Lattice-Boltzmann Kernel First Talk MuCoSim, 29. June 2016 Michael Hußnätter RRZE HPC Group Friedrich-Alexander University of Erlangen-Nuremberg Outline Lattice Boltzmann

More information

ICS 233 Computer Architecture & Assembly Language

ICS 233 Computer Architecture & Assembly Language ICS 233 Computer Architecture & Assembly Language Assignment 6 Solution 1. Identify all of the RAW data dependencies in the following code. Which dependencies are data hazards that will be resolved by

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

3. (2) What is the difference between fixed and hybrid instructions?

3. (2) What is the difference between fixed and hybrid instructions? 1. (2 pts) What is a "balanced" pipeline? 2. (2 pts) What are the two main ways to define performance? 3. (2) What is the difference between fixed and hybrid instructions? 4. (2 pts) Clock rates have grown

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

COVER SHEET: Problem#: Points

COVER SHEET: Problem#: Points EEL 4712 Midterm 3 Spring 2017 VERSION 1 Name: UFID: Sign here to give permission for your test to be returned in class, where others might see your score: IMPORTANT: Please be neat and write (or draw)

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.

More information

arxiv: v1 [hep-lat] 8 Nov 2014

arxiv: v1 [hep-lat] 8 Nov 2014 Staggered Dslash Performance on Intel Xeon Phi Architecture arxiv:1411.2087v1 [hep-lat] 8 Nov 2014 Department of Physics, Indiana University, Bloomington IN 47405, USA E-mail: ruizli AT umail.iu.edu Steven

More information

Logic BIST. Sungho Kang Yonsei University

Logic BIST. Sungho Kang Yonsei University Logic BIST Sungho Kang Yonsei University Outline Introduction Basics Issues Weighted Random Pattern Generation BIST Architectures Deterministic BIST Conclusion 2 Built In Self Test Test/ Normal Input Pattern

More information

Huge-Scale Molecular Dynamics Simulation of Multi-bubble Nuclei

Huge-Scale Molecular Dynamics Simulation of Multi-bubble Nuclei 1/20 Huge-Scale Molecular Dynamics Simulation of Multi-bubble Nuclei H. Watanabe ISSP, The M. Suzuki H. Inaoka N. Ito Kyushu University RIKEN AICS The, RIKEN AICS Outline 1. Introduction 2. Benchmark results

More information

P214 Efficient Computation of Passive Seismic Interferometry

P214 Efficient Computation of Passive Seismic Interferometry P214 Efficient Computation of Passive Seismic Interferometry J.W. Thorbecke* (Delft University of Technology) & G.G. Drijkoningen (Delft University of Technology) SUMMARY Seismic interferometry is from

More information

SPECIAL PROJECT PROGRESS REPORT

SPECIAL PROJECT PROGRESS REPORT SPECIAL PROJECT PROGRESS REPORT Progress Reports should be 2 to 10 pages in length, depending on importance of the project. All the following mandatory information needs to be provided. Reporting year

More information

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic.

Digital Integrated Circuits A Design Perspective. Arithmetic Circuits. Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic. Digital Integrated Circuits A Design Perspective Jan M. Rabaey Anantha Chandrakasan Borivoje Nikolic Arithmetic Circuits January, 2003 1 A Generic Digital Processor MEMORY INPUT-OUTPUT CONTROL DATAPATH

More information

arxiv: v1 [cs.dc] 4 Sep 2014

arxiv: v1 [cs.dc] 4 Sep 2014 and NVIDIA R GPUs arxiv:1409.1510v1 [cs.dc] 4 Sep 2014 O. Kaczmarek, C. Schmidt and P. Steinbrecher Fakultät für Physik, Universität Bielefeld, D-33615 Bielefeld, Germany E-mail: okacz, schmidt, p.steinbrecher@physik.uni-bielefeld.de

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

Lecture 4: Linear Algebra 1

Lecture 4: Linear Algebra 1 Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation

More information

Linear System of Equations

Linear System of Equations Linear System of Equations Linear systems are perhaps the most widely applied numerical procedures when real-world situation are to be simulated. Example: computing the forces in a TRUSS. F F 5. 77F F.

More information

Automated design of floating-point logarithm functions on integer processors

Automated design of floating-point logarithm functions on integer processors 23rd IEEE Symposium on Computer Arithmetic Santa Clara, CA, USA, 10-13 July 2016 Automated design of floating-point logarithm functions on integer processors Guillaume Revy (presented by Florent de Dinechin)

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

Transposition Mechanism for Sparse Matrices on Vector Processors

Transposition Mechanism for Sparse Matrices on Vector Processors Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands

More information

CSCI-564 Advanced Computer Architecture

CSCI-564 Advanced Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 8: Handling Exceptions and Interrupts / Superscalar Bo Wu Colorado School of Mines Branch Delay Slots (expose control hazard to software) Change the ISA

More information

ww.padasalai.net

ww.padasalai.net t w w ADHITHYA TRB- TET COACHING CENTRE KANCHIPURAM SUNDER MATRIC SCHOOL - 9786851468 TEST - 2 COMPUTER SCIENC PG - TRB DATE : 17. 03. 2019 t et t et t t t t UNIT 1 COMPUTER SYSTEM ARCHITECTURE t t t t

More information

4. (3) What do we mean when we say something is an N-operand machine?

4. (3) What do we mean when we say something is an N-operand machine? 1. (2) What are the two main ways to define performance? 2. (2) When dealing with control hazards, a prediction is not enough - what else is necessary in order to eliminate stalls? 3. (3) What is an "unbalanced"

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Figure 4.9 MARIE s Datapath

Figure 4.9 MARIE s Datapath Term Control Word Microoperation Hardwired Control Microprogrammed Control Discussion A set of signals that executes a microoperation. A register transfer or other operation that the CPU can execute in

More information

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators PARIS-SACLAY, FRANCE Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen b April the 6 th, 2016 a Université de Lille, Sciences et Technologies

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS Berk Hess, Szilárd Páll KTH Royal Institute of Technology GTC 2012 GROMACS: fast, scalable, free Classical molecular dynamics package

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX 26 Septembre 2018 - JCAD 2018 - Lyon Grégoire Pichon, Mathieu Faverge, Pierre Ramet, Jean Roman Outline 1. Context 2.

More information

HPMPC - A new software package with efficient solvers for Model Predictive Control

HPMPC - A new software package with efficient solvers for Model Predictive Control - A new software package with efficient solvers for Model Predictive Control Technical University of Denmark CITIES Second General Consortium Meeting, DTU, Lyngby Campus, 26-27 May 2015 Introduction Model

More information

[2] Predicting the direction of a branch is not enough. What else is necessary?

[2] Predicting the direction of a branch is not enough. What else is necessary? [2] What are the two main ways to define performance? [2] Predicting the direction of a branch is not enough. What else is necessary? [2] The power consumed by a chip has increased over time, but the clock

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Cache Oblivious Stencil Computations

Cache Oblivious Stencil Computations Cache Oblivious Stencil Computations S. HUNOLD J. L. TRÄFF F. VERSACI Lectures on High Performance Computing 13 April 2015 F. Versaci (TU Wien) Cache Oblivious Stencil Computations 13 April 2015 1 / 19

More information

A Hybrid Method for the Wave Equation. beilina

A Hybrid Method for the Wave Equation.   beilina A Hybrid Method for the Wave Equation http://www.math.unibas.ch/ beilina 1 The mathematical model The model problem is the wave equation 2 u t 2 = (a 2 u) + f, x Ω R 3, t > 0, (1) u(x, 0) = 0, x Ω, (2)

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Lecture 19. Architectural Directions

Lecture 19. Architectural Directions Lecture 19 Architectural Directions Today s lecture Advanced Architectures NUMA Blue Gene 2010 Scott B. Baden / CSE 160 / Winter 2010 2 Final examination Announcements Thursday, March 17, in this room:

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

CMP 334: Seventh Class

CMP 334: Seventh Class CMP 334: Seventh Class Performance HW 5 solution Averages and weighted averages (review) Amdahl's law Ripple-carry adder circuits Binary addition Half-adder circuits Full-adder circuits Subtraction, negative

More information

Leigh Orf 1 Robert Wilhelmson 2,3 Roberto Sisneros 3 Brian Jewett 2 George Bryan 4 Mark Straka 3 Paul Woodward 5

Leigh Orf 1 Robert Wilhelmson 2,3 Roberto Sisneros 3 Brian Jewett 2 George Bryan 4 Mark Straka 3 Paul Woodward 5 Simulation and Visualization of Tornadic Supercells on Blue Waters PRAC: Understanding Tornadoes and Their Parent Supercells Through Ultra-High Resolution Simulation/Analysis Leigh Orf 1 Robert Wilhelmson

More information

Hardware Acceleration of the Tate Pairing in Characteristic Three

Hardware Acceleration of the Tate Pairing in Characteristic Three Hardware Acceleration of the Tate Pairing in Characteristic Three CHES 2005 Hardware Acceleration of the Tate Pairing in Characteristic Three Slide 1 Introduction Pairing based cryptography is a (fairly)

More information

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters -- Parallel Processing for Energy Efficiency October 3, 2013 NTNU, Trondheim, Norway Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer

More information

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Modeling and Tuning Parallel Performance in Dense Linear Algebra Modeling and Tuning Parallel Performance in Dense Linear Algebra Initial Experiences with the Tile QR Factorization on a Multi Core System CScADS Workshop on Automatic Tuning for Petascale Systems Snowbird,

More information

A Trillion Particles: Studying Large-Scale Structure Formation on the BG/Q

A Trillion Particles: Studying Large-Scale Structure Formation on the BG/Q A Trillion Particles: Studying Large-Scale Structure Formation on the BG/Q David Daniel, Patricia Fasel, Hal Finkel, Nicholas Frontiere, Salman Habib, Katrin Heitmann, Zarija Lukic, Adrian Pope July 10,

More information

EXPLOITING RESIDUE NUMBER SYSTEM FOR POWER-EFFICIENT DIGITAL SIGNAL PROCESSING IN EMBEDDED PROCESSORS

EXPLOITING RESIDUE NUMBER SYSTEM FOR POWER-EFFICIENT DIGITAL SIGNAL PROCESSING IN EMBEDDED PROCESSORS EXPLOITING RESIDUE NUMBER SYSTEM FOR POWER-EFFICIENT DIGITAL SIGNAL PROCESSING IN EMBEDDED PROCESSORS Rooju Chokshi 1, Krzysztof S. Berezowski 2,3, Aviral Shrivastava 2, Stanisław J. Piestrak 4 1 Microsoft

More information

(Group-theoretic) Fast Matrix Multiplication

(Group-theoretic) Fast Matrix Multiplication (Group-theoretic) Fast Matrix Multiplication Ivo Hedtke Data Structures and Efficient Algorithms Group (Prof Dr M Müller-Hannemann) Martin-Luther-University Halle-Wittenberg Institute of Computer Science

More information

FPGA Implementation of a Predictive Controller

FPGA Implementation of a Predictive Controller FPGA Implementation of a Predictive Controller SIAM Conference on Optimization 2011, Darmstadt, Germany Minisymposium on embedded optimization Juan L. Jerez, George A. Constantinides and Eric C. Kerrigan

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished?

1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 1. (2 )Clock rates have grown by a factor of 1000 while power consumed has only grown by a factor of 30. How was this accomplished? 2. (2 )What are the two main ways to define performance? 3. (2 )What

More information

arxiv: v1 [hep-lat] 10 Jul 2012

arxiv: v1 [hep-lat] 10 Jul 2012 Hybrid Monte Carlo with Wilson Dirac operator on the Fermi GPU Abhijit Chakrabarty Electra Design Automation, SDF Building, SaltLake Sec-V, Kolkata - 700091. Pushan Majumdar Dept. of Theoretical Physics,

More information

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042 Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SR and Freescale under SR task number 2042 Lukasz G. Szafaryn, Kevin Skadron Department of omputer Science

More information

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk

ECE290 Fall 2012 Lecture 22. Dr. Zbigniew Kalbarczyk ECE290 Fall 2012 Lecture 22 Dr. Zbigniew Kalbarczyk Today LC-3 Micro-sequencer (the control store) LC-3 Micro-programmed control memory LC-3 Micro-instruction format LC -3 Micro-sequencer (the circuitry)

More information

CprE 281: Digital Logic

CprE 281: Digital Logic CprE 28: Digital Logic Instructor: Alexander Stoytchev http://www.ece.iastate.edu/~alexs/classes/ Simple Processor CprE 28: Digital Logic Iowa State University, Ames, IA Copyright Alexander Stoytchev Digital

More information

CMP 338: Third Class

CMP 338: Third Class CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does

More information

Homework 4 due today Quiz #4 today In class (80min) final exam on April 29 Project reports due on May 4. Project presentations May 5, 1-4pm

Homework 4 due today Quiz #4 today In class (80min) final exam on April 29 Project reports due on May 4. Project presentations May 5, 1-4pm EE241 - Spring 2010 Advanced Digital Integrated Circuits Lecture 25: Digital Arithmetic Adders Announcements Homework 4 due today Quiz #4 today In class (80min) final exam on April 29 Project reports due

More information

CPSC 3300 Spring 2017 Exam 2

CPSC 3300 Spring 2017 Exam 2 CPSC 3300 Spring 2017 Exam 2 Name: 1. Matching. Write the correct term from the list into each blank. (2 pts. each) structural hazard EPIC forwarding precise exception hardwired load-use data hazard VLIW

More information

The next-generation supercomputer and NWP system of the JMA

The next-generation supercomputer and NWP system of the JMA The next-generation supercomputer and NWP system of the JMA Masami NARITA m_narita@naps.kishou.go.jp Numerical Prediction Division (NPD), Japan Meteorological Agency (JMA) Purpose of supercomputer & NWP

More information

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems G. Hager HPC Services, Computing Center Erlangen, Germany E. Jeckelmann Theoretical Physics, Univ.

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

Lattice QCD with Domain Decomposition on Intel R Xeon Phi TM

Lattice QCD with Domain Decomposition on Intel R Xeon Phi TM Lattice QCD with Domain Decomposition on Intel R Xeon Phi TM Co-Processors Simon Heybrock, Bálint Joó, Dhiraj D. Kalamkar, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Tilo Wettig, and Pradeep Dubey

More information

Chenhan D. Yu The 3rd BLIS Retreat Sep 28, 2015

Chenhan D. Yu The 3rd BLIS Retreat Sep 28, 2015 GSKS GSKNN BLIS-Based High Performance Computing Kernels in N-body Problems Chenhan D. Yu The 3rd BLIS Retreat Sep 28, 2015 N-body Problems Hellstorm Astronomy and 3D https://youtu.be/bllwkx_mrfk 2 N-body

More information

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster Yuta Hirokawa Graduate School of Systems and Information Engineering, University of Tsukuba hirokawa@hpcs.cs.tsukuba.ac.jp

More information

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I.

Department of Electrical and Computer Engineering University of Wisconsin - Madison. ECE/CS 752 Advanced Computer Architecture I. Last (family) name: Solution First (given) name: Student I.D. #: Department of Electrical and Computer Engineering University of Wisconsin - Madison ECE/CS 752 Advanced Computer Architecture I Midterm

More information

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator F. Mehdipour, Hiroaki Honda*, * H. Kataoka, K. Inoue and K. Murakami Graduate School of Information Science and Electrical

More information

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations Sparse Linear Systems Iterative Methods for Sparse Linear Systems Matrix Computations and Applications, Lecture C11 Fredrik Bengzon, Robert Söderlund We consider the problem of solving the linear system

More information

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference)

ECE 3401 Lecture 23. Pipeline Design. State Table for 2-Cycle Instructions. Control Unit. ISA: Instruction Specifications (for reference) ECE 3401 Lecture 23 Pipeline Design Control State Register Combinational Control Logic New/ Modified Control Word ISA: Instruction Specifications (for reference) P C P C + 1 I N F I R M [ P C ] E X 0 PC

More information

Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1

Loop Scheduling and Software Pipelining \course\cpeg421-08s\topic-7.ppt 1 Loop Scheduling and Software Pipelining 2008-04-24 \course\cpeg421-08s\topic-7.ppt 1 Reading List Slides: Topic 7 and 7a Other papers as assigned in class or homework: 2008-04-24 \course\cpeg421-08s\topic-7.ppt

More information

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21 st 2013 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler, Ulrich

More information

Computer Architecture

Computer Architecture Computer Architecture QtSpim, a Mips simulator S. Coudert and R. Pacalet January 4, 2018..................... Memory Mapping 0xFFFF000C 0xFFFF0008 0xFFFF0004 0xffff0000 0x90000000 0x80000000 0x7ffff4d4

More information

Exploring performance and power properties of modern multicore chips via simple machine models

Exploring performance and power properties of modern multicore chips via simple machine models Exploring performance and power properties of modern multicore chips via simple machine models G. Hager, J. Treibig, J. Habich, and G. Wellein Erlangen Regional Computing Center (RRZE) Martensstr. 1, 9158

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems S4283 - Subdivide, : Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems Elmar Westphal - Forschungszentrum Jülich GmbH 1 Contents Micromagnetism TetraMag, a FEM/BEM Micromagnetism Simulator

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Yuchun Ma* Zhuoyuan Li* Jason Cong Xianlong Hong Glenn Reinman Sheqin Dong* Qiang Zhou *Department of Computer Science &

More information