3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

Similar documents
On the design of parallel linear solvers for large scale problems

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

On the design of parallel linear solvers for large scale problems

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

ANR Project DEDALES Algebraic and Geometric Domain Decomposition for Subsurface Flow

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

High performance computing for neutron diffusion and transport equations

Lightweight Superscalar Task Execution in Distributed Memory

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Lattice Quantum Chromodynamics on the MIC architectures

Nonlinear Iterative Solution of the Neutron Transport Equation

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Adjoint-Based Uncertainty Quantification and Sensitivity Analysis for Reactor Depletion Calculations

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Contents. Preface... xi. Introduction...

HPMPC - A new software package with efficient solvers for Model Predictive Control

Nonlinear Computational Methods for Simulating Interactions of Radiation with Matter in Physical Systems

Solving PDEs with CUDA Jonathan Cohen

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Multipole-Based Preconditioners for Sparse Linear Systems.

Using the Application Builder for Neutron Transport in Discrete Ordinates

Dynamic Scheduling within MAGMA

An Angular Multigrid Acceleration Method for S n Equations with Highly Forward-Peaked Scattering

Direct Self-Consistent Field Computations on GPU Clusters

Performance comparison between hybridizable DG and classical DG methods for elastic waves simulation in harmonic domain

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

R. Glenn Brook, Bilel Hadri*, Vincent C. Betro, Ryan C. Hulguin, and Ryan Braby Cray Users Group 2012 Stuttgart, Germany April 29 May 3, 2012

Analytical Modeling of Parallel Programs. S. Oliveira

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Case Study: Quantum Chromodynamics

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Measuring freeze-out parameters on the Bielefeld GPU cluster

Computational Numerical Integration for Spherical Quadratures. Verified by the Boltzmann Equation

Large Scale Sparse Linear Algebra

The Next Generation of Astrophysical Simulations of Compact Objects

RWTH Aachen University

THREE-DIMENSIONAL INTEGRAL NEUTRON TRANSPORT CELL CALCULATIONS FOR THE DETERMINATION OF MEAN CELL CROSS SECTIONS

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Towards Highly Parallel and Compute-Bound Computation of Eigenvectors of Matrices in Schur Form

ERLANGEN REGIONAL COMPUTING CENTER

ab initio Electronic Structure Calculations

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Panorama des modèles et outils de programmation parallèle

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

In this method, one defines

Improving Dynamical Core Scalability, Accuracy, and Limi:ng Flexibility with the ADER- DT Time Discre:za:on

Domain Decomposition-based contour integration eigenvalue solvers

Some Geometric and Algebraic Aspects of Domain Decomposition Methods

Parallele Numerik. Miriam Mehl. Institut für Parallele und Verteilte Systeme Universität Stuttgart

Computing least squares condition numbers on hybrid multicore/gpu systems

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Efficient algorithms for symmetric tensor contractions

SPATIAL MULTIGRID FOR ISOTROPIC NEUTRON TRANSPORT

Mitglied der Helmholtz-Gemeinschaft. Linear algebra tasks in Materials Science: optimization and portability

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Dense Arithmetic over Finite Fields with CUMODP

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Algebraic Multigrid as Solvers and as Preconditioner

Analysis of PFLOTRAN on Jaguar

FPGA Implementation of a Predictive Controller

A parallel exponential integrator for large-scale discretizations of advection-diffusion models

An asymptotic preserving unified gas kinetic scheme for the grey radiative transfer equations

Reduced Vlasov-Maxwell modeling

Binding Performance and Power of Dense Linear Algebra Operations

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

Schwarz-type methods and their application in geomechanics

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Review for the Midterm Exam

FEAST eigenvalue algorithm and solver: review and perspectives

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

AN INDEPENDENT LOOPS SEARCH ALGORITHM FOR SOLVING INDUCTIVE PEEC LARGE PROBLEMS

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Scheduling Trees of Malleable Tasks for Sparse Linear Algebra

Petascale Quantum Simulations of Nano Systems and Biomolecules

Improvements for Implicit Linear Equation Solvers

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Performance optimization of WEST and Qbox on Intel Knights Landing

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

What is the Cost of Determinism? Cedomir Segulja, Tarek S. Abdelrahman University of Toronto

Transcription:

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 9th Scheduling for Large Scale Systems Workshop, Lyon S. Moustafa, M. Faverge, L. Plagne, and P. Ramet S. Moustafa, M. Faverge L. Plagne, and P. Ramet LabRI Inria Bordeaux Sud-Ouest R&D SINETICS August 19, 2014

1 Context and goals

Context and goals Guideline Context and goals Parallelization Strategies Sweep Theoritical Model DOMINO on top of PaRSEC Results Conclusion and future works M. Faverge - 9th Scheduling Workshop August 19, 2014-3

Context and goals Context EDF R&D is looking for a Fast Reference Solver PhD Student: Salli Moustafa Industrial solvers: diffusion approximation ( SP1); COCAGNE (SPN). Solution on more than 10 11 degrees of freedom (DoFs) involved probabilistic solvers (very long computation time); deterministic solvers. DOMINO (SN) is designed for this validation purpose. M. Faverge - 9th Scheduling Workshop August 19, 2014-4

Context and goals DOMINO: Discrete Ordinates Method In NeutrOnics Deterministic, Cartesian, and 3D solver; 3 levels of discretization: energy (G): multigroup formalism; angle ( Ω): Level Symmetric Quadrature, N(N + 2) directions space (x, y, z): Diamond Differencing scheme (order 0); 3 nested levels of iterations: power iterations + Chebychev acceleration; multigroup iterations: Gauss Seidel algorithm; scattering iterations + DSA acceleration (using the SPN solver): spatial sweep, which consumes most of the computation time. M. Faverge - 9th Scheduling Workshop August 19, 2014-5

Context and goals The Sweep Algorithm forall the o Octants do forall the c Cells do c = (i, j, k) forall the d Directions[o] do d = (ν, µ, ξ) end end end ɛ x = 2ν x ; 2η ɛy = y ; 2ξ ɛz = z ; ψ[o][c][d] = ɛx ψ L +ɛy ψ B +ɛz ψ F +S ; ɛx +ɛy +ɛz+σt ψ R [o][c][d] = 2ψ[o][c][d] ψ L [o][c][d]; ψ T [o][c][d] = 2ψ[o][c][d] ψ B [o][c][d]; ψ BF [o][c][d] = 2ψ[o][c][d] ψ F [o][c][d]; φ[k][j][i] = φ[k][j][i] + ψ[o][c][d] ω[d]; 9 add or sub; 11 mul; 1 div (5 flops) 25 flops per cell, per direction, per energy group. M. Faverge - 9th Scheduling Workshop August 19, 2014-6

Context and goals The Spatial Sweep (Diamond Differencing scheme) (1/2) 3D regular mesh with per cell, per angle, per energy group: 1 moment to update 3 incoming fluxes 3 outgoing fluxes M. Faverge - 9th Scheduling Workshop August 19, 2014-7

Context and goals The Spatial Sweep (Diamond Differencing scheme) (2/2) 2D example of the spatial mesh for one octant At the beginning, data are known only on the incoming faces ready cell M. Faverge - 9th Scheduling Workshop August 19, 2014-8

Context and goals The Spatial Sweep (Diamond Differencing scheme) (2/2) 2D example of the spatial mesh for one octant processed cell ready cell M. Faverge - 9th Scheduling Workshop August 19, 2014-8

Context and goals The Spatial Sweep (Diamond Differencing scheme) (2/2) 2D example of the spatial mesh for one octant... after a few steps processed cell ready cell M. Faverge - 9th Scheduling Workshop August 19, 2014-8

2 Parallelization Strategies

Parallelization Strategies Many opportunities for parallelism Each level of discretization is a potentially independent computation: energy group angles space All energy groups are computed together All angles are considered independent This is not true when problems have boundary conditions All cell updates on a front are independent M. Faverge - 9th Scheduling Workshop August 19, 2014-10

Parallelization Strategies Angular Parallelization Level (Very Low Level) Several directions belong to the same octant: Vectorization of the computation Use of SIMD units at processor/core level improve kernel performance M. Faverge - 9th Scheduling Workshop August 19, 2014-11

Parallelization Strategies Spatial Parallelization First level: granularity processed cell ready cell Grouping cells in MacroCells: Reduces thread scheduling overhead Similar to exploiting BLAS 3 Reduces overall parallelism M. Faverge - 9th Scheduling Workshop August 19, 2014-12

Parallelization Strategies Spatial Parallelization First level: granularity Grouping cells in MacroCells: Reduces thread scheduling overhead Similar to exploiting BLAS 3 Reduces overall parallelism M. Faverge - 9th Scheduling Workshop August 19, 2014-12

Parallelization Strategies Octant Parallelization Case of Vacuum Boundary Conditions When using vacuum boundary conditions, all octants are independent from each other M. Faverge - 9th Scheduling Workshop August 19, 2014-13

Parallelization Strategies Octant Parallelization Case of Vacuum Boundary Conditions When using vacuum boundary conditions, all octants are independent from each other M. Faverge - 9th Scheduling Workshop August 19, 2014-13

Parallelization Strategies Octant Parallelization Case of Vacuum Boundary Conditions When using vacuum boundary conditions, all octants are independent from each other M. Faverge - 9th Scheduling Workshop August 19, 2014-13

Parallelization Strategies Octant Parallelization Case of Vacuum Boundary Conditions Concurrent access to a cell (or MacroCell) are protected by mutexes. M. Faverge - 9th Scheduling Workshop August 19, 2014-13

Parallelization Strategies Octant Parallelization Case of Vacuum Boundary Conditions Concurrent access to a cell (or MacroCell) are protected by mutexes. M. Faverge - 9th Scheduling Workshop August 19, 2014-13

3 Sweep Theoritical Model

Sweep Theoritical Model Basic formulas We define the efficiency of the sweep algorithm as follow: ɛ = = T task N tasks (N tasks + N idle ) (T task + T comm ) 1 (1 + N idle /N tasks ) (1 + T comm /T task ) Objective: Minimize N idle M. Faverge - 9th Scheduling Workshop August 19, 2014-15

Sweep Theoritical Model For 3D block distribution The minimal number of idle steps are those required to reach the cube center: N min idle = P x + δ x 2 + P y + δ y 2 + P z + δ z 2 where δ u = 0, if P u is even, 1 otherwise. Objective: Minimize the sum P + Q + R, where P Q R is the process grid. Hybrid MPI-Thread implementation allows this M. Faverge - 9th Scheduling Workshop August 19, 2014-16

Sweep Theoritical Model Hybrid Model M. Faverge - 9th Scheduling Workshop August 19, 2014-17

4 DOMINO on top of PaRSEC

DOMINO on top of PaRSEC DOMINO on top of PaRSEC Implementation Only one kind of task: Associated to one MacroCell All energy group All directions included in one octant 8 tasks per MacroCell No dependencies from one octant to another protected by mutexes Simple algorithm to write in JDF Require a data distribution: Independent from the algorithm: 2D, 3D, cyclic or not,... For now: Block-3D (Non cyclic) with a P Q R grid Fluxes on faces are dynamically allocated/freed by the runtime M. Faverge - 9th Scheduling Workshop August 19, 2014-19

DOMINO on top of PaRSEC DOMINO JDF Representation (2D) 1 C e l l U p d a t e ( a, b ) 2 3 / E x e c u t i o n Space / 4 a = 0.. ncx 1 5 b = 0.. ncy 1 6 7 / Task L o c a l i t y ( Owner Compute ) / 8 : mcg ( a, b ) 9 10 / Data d e p e n d e n c i e s / 11 RW X < ( a!= abeg )? X CellUpdate ( a ainc, b ) : X READ X( b ) 12 > ( a!= aend )? X CellUpdate ( a+ainc, b ) 13 RW Y < ( b!= bbeg )? Y CellUpdate ( a, b binc ) : Y READ Y( a ) 14 > ( b!= bend )? Y C e l l U p d a t e ( a, b+b I n c ) 15 RW MCG < mcg ( a, b ) 16 > mcg ( a, b ) 17 BODY 18 { 19 s o l v e ( MCG, X, Y,... ) ; 20 } 21 END abeg, aend, ainc, bbeg, bend and binc are octant dependent variables. M. Faverge - 9th Scheduling Workshop August 19, 2014-20

5 Results

Results Scalability of the existing implementation with Intel TBB 32-core Nehalem node with two 4-way SIMD units running at 2.26 Ghz 2 energy groups calculation; S8 Level Symmetric quadrature (80 angular directions); spatial mesh: 120 120 120 and 480 480 480. 100 10 Case 120 Case 120+OctPar Case 480 Case 480+OctPar Time (s) 1 0.1 0.01 1 2 4 8 16 32 Core Number M. Faverge - 9th Scheduling Workshop August 19, 2014-22

Results DOMINO on top of PaRSEC Shared Memory Results: Comparison with Intel TBB 180 160 S2 - PaRSEC S2 - TBB Peak Gflops 140 120 100 80 60 40 20 1 energy group; mesh size: 480 480 480; Level Symmetric S2; 7.9 Gflops (4.6%) 0 1 2 3 4 5 6 7 8 Core Number M. Faverge - 9th Scheduling Workshop August 19, 2014-23

Results DOMINO on top of PaRSEC Shared Memory Results: Comparison with Intel TBB 180 160 S8 - PaRSEC S8 - TBB Peak Gflops 140 120 100 80 60 40 20 1 energy group; mesh size: 480 480 480; Level Symmetric S8; 57.2 Gflops (33.5%) 0 1 2 3 4 5 6 7 8 Core Number M. Faverge - 9th Scheduling Workshop August 19, 2014-23

Results DOMINO on top of PaRSEC Shared Memory Results: Comparison with Intel TBB 180 160 S16 - PaRSEC S16 - TBB Peak Gflops 140 120 100 80 60 40 20 1 energy group; mesh size: 480 480 480; Level Symmetric S16; 92.6 Gflops (54.2%) 0 1 2 3 4 5 6 7 8 Core Number M. Faverge - 9th Scheduling Workshop August 19, 2014-23

Results DOMINO on top of PaRSEC Shared Memory Results: Comparison with Intel TBB Test on manumanu NUMA node: 160 cores. 200 180 S16 - PaRSEC S16 - TBB Peak 160 140 120 Gflops 100 80 60 40 20 16 32 48 64 80 96 112 128 144 160 Core Number M. Faverge - 9th Scheduling Workshop August 19, 2014-24

Results DOMINO on top of PaRSEC Distributed Memory Results (Ivanoe) 1 energy group; mesh size: 480 480 480; Level Symmetric S16; 100 10 Theoretical Peak Adams et al. no comm Adams et al. ML 1 Hybrid model no comm Hybrid model PaRSEC Hybrid PaRSEC Full MPI 1 0.1 12 24 48 96 144 216 288 384 432 576 768 Tflop/s Core Number M. Faverge - 9th Scheduling Workshop August 19, 2014-25

Results DOMINO on top of PaRSEC Distributed Memory Results (Athos) 1 energy group; mesh size: 480 480 480; Level Symmetric S16; 100 PaRSEC Hybrid Hybrid no comm 10 1 24 48 96 192 288 384 432 576 768 864 1152 1536 2304 3072 6144 Tflop/s Core Number M. Faverge - 9th Scheduling Workshop August 19, 2014-26

Results DOMINO on top of PaRSEC Distributed Memory Results (Athos) 1 energy group; mesh size: 120 120 120; Level Symmetric S16; 10 Hybrid model - no comm Hybrid model PaRSEC Tflop/s 1 0.1 12 24 48 96 144 216 288 384 432 576 768 Core Number M. Faverge - 9th Scheduling Workshop August 19, 2014-27

Results DOMINO on top of PaRSEC Distributed Memory Results Execution trace for a run on 8 nodes (2, 2, 2) (Bad scheduling). M. Faverge - 9th Scheduling Workshop August 19, 2014-28

Results DOMINO on top of PaRSEC Distributed Memory Results Execution trace for a run on 8 nodes (2, 2, 2) (Good scheduling). M. Faverge - 9th Scheduling Workshop August 19, 2014-28

Results DOMINO on top of PaRSEC Scheduling by front Disco NoPrio Prio M. Faverge - 9th Scheduling Workshop August 19, 2014-29

6 Conclusion and future works

Conclusion and future works Conclusion and Future Work Conclusion Efficient implementation on top of PaRSEC Less than 2 weeks to be implemented Comparable to Intel TBB in shared memory multi-level implementation: Code vectorization (angular direction) Block algorithm (MacroCells) Hybrid MPI-Thread implementation Future work Finish the hybrid model to get better evaluation of the performance Experiments on Intel Xeon Phi M. Faverge - 9th Scheduling Workshop August 19, 2014-31

Thanks!