Computational Numerical Integration for Spherical Quadratures. Verified by the Boltzmann Equation

Similar documents
Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

R. Glenn Brook, Bilel Hadri*, Vincent C. Betro, Ryan C. Hulguin, and Ryan Braby Cray Users Group 2012 Stuttgart, Germany April 29 May 3, 2012

MULTIVARIABLE INTEGRATION

Parallel Program Performance Analysis

Performance Evaluation of MPI on Weather and Hydrological Models

arxiv: v1 [hep-lat] 8 Nov 2014

Early Application Experiences with the Intel MIC Architecture in a Cray CX1

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Logo. A Massively-Parallel Multicore Acceleration of a Point Contact Solid Mechanics Simulation DRAFT

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Lightweight Superscalar Task Execution in Distributed Memory

LAB 8: INTEGRATION. Figure 1. Approximating volume: the left by cubes, the right by cylinders

Parallel Discontinuous Galerkin Method

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Mathematics 205 Solutions for HWK 23. e x2 +y 2 dxdy

Parallel Sparse Tensor Decompositions using HiCOO Format

Some notes on efficient computing and setting up high performance computing environments

Review for the Midterm Exam

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallel Simulations of Self-propelled Microorganisms

Perm State University Research-Education Center Parallel and Distributed Computing

Research of the new Intel Xeon Phi architecture for solving a wide range of scientific problems at JINR

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms

Hideyuki Usui 1,3, M. Nunami 2,3, Y. Yagi 1,3, T. Moritaka 1,3, and JST/CREST multi-scale PIC simulation team

Integrals in cylindrical, spherical coordinates (Sect. 15.7)

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

All-electron density functional theory on Intel MIC: Elk

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Solution Manual For E&M TIPERs Electricity & Magnetism Tasks by Hieggelke Link download full:

Domain Decomposition-based contour integration eigenvalue solvers

MULTIVARIABLE INTEGRATION

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

Dense Arithmetic over Finite Fields with CUMODP

A Hamiltonian Numerical Scheme for Large Scale Geophysical Fluid Systems

Parallel Polynomial Evaluation

Extending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations

Communication-avoiding LU and QR factorizations for multicore architectures

Results from DoD HPCMP CREATE TM -AV Kestrel for the 3 rd AIAA High Lift Prediction Workshop

RWTH Aachen University

The Lattice Boltzmann method for hyperbolic systems. Benjamin Graille. October 19, 2016

Figure 1 - Resources trade-off. Image of Jim Kinter (COLA)

Performance optimization of WEST and Qbox on Intel Knights Landing

Asynchronous Parareal in time discretization for partial differential equations

Benders Decomposition for the Uncapacitated Multicommodity Network Design Problem

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

arxiv: v2 [math.na] 21 Apr 2014

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Core-collapse Supernove through Cosmic Time...

Broyden Mixing for Nuclear Density Functional Calculations

Lattice Quantum Chromodynamics on the MIC architectures

Figure 25:Differentials of surface.

Fine-grained Parallel Incomplete LU Factorization

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Weile Jia 1, Long Wang 1, Zongyan Cao 1, Jiyun Fu 1, Xuebin Chi 1, Weiguo Gao 2, Lin-Wang Wang 3

Comparison of Numerical Solutions for the Boltzmann Equation and Different Moment Models

Numerical Modeling of Laminar, Reactive Flows with Detailed Kinetic Mechanisms

Parallel Transposition of Sparse Data Structures

The Performance Evolution of the Parallel Ocean Program on the Cray X1

A parallel implementation of an MHD code for the simulation of mechanically driven, turbulent dynamos in spherical geometry

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Micro-macro methods for Boltzmann-BGK-like equations in the diffusion scaling

Improving Dynamical Core Scalability, Accuracy, and Limi:ng Flexibility with the ADER- DT Time Discre:za:on

Numerical Simulation of Rarefied Gases using Hyperbolic Moment Equations in Partially-Conservative Form

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Clément Surville, Lucio Mayer. Insitute for Computational Science, University of Zürich. Yann Alibert. Institute of Physics, University of Bern

GPU Accelerated Markov Decision Processes in Crowd Simulation

Processing NOAA Observation Data over Hybrid Computer Systems for Comparative Climate Change Analysis

Multi-GPU Simulations of the Infinite Universe

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Practice problems. m zδdv. In our case, we can cancel δ and have z =

A Strategy for the Development of Coupled Ocean- Atmosphere Discontinuous Galerkin Models

Parallel Tempering And Adaptive Spacing. For Monte Carlo Simulation

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Phys 122 Lecture 3 G. Rybka

Introduction to numerical computations on the GPU

Dark Energy and Massive Neutrino Universe Covariances

A Conservative Discontinuous Galerkin Scheme with O(N 2 ) Operations in Computing Boltzmann Collision Weight Matrix

Analysis of PFLOTRAN on Jaguar

14. Rotational Kinematics and Moment of Inertia

LibMesh Experience and Usage

Analytical Modeling of Parallel Programs. S. Oliveira

Rigid Body Motion. Greg Hager Simon Leonard

A Cartesian Cut Cell Method for Rarefied Flow Simulations around Moving Obstacles. G. Dechristé 1, L. Mieussens 2

General Physics I. Lecture 10: Rolling Motion and Angular Momentum.

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Convergence Models and Surprising Results for the Asynchronous Jacobi Method

Optimal Control of the Schrödinger Equation on Many-Core Architectures

Arches Part 1: Introduction to the Uintah Computational Framework. Charles Reid Scientific Computing Summer Workshop July 14, 2010

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Parameter-dependent Parallel Block Sparse Arnoldi and Döhler Algorithms on Distributed Systems

********************************************************** 1. Evaluate the double or iterated integrals:

Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi

Transcription:

Computational Numerical Integration for Spherical Quadratures Verified by the Boltzmann Equation Huston Rogers. 1 Glenn Brook, Mentor 2 Greg Peterson, Mentor 2 1 The University of Alabama 2 Joint Institute for Computational Science University of Tennessee

Motivation Outline 1 Motivation The Basic Problem Coding 2 Grid Refinement Utilizing the Ideal grid

The Basic Problem. Spherical Quadrature Spherical Quadratures are more natural to use with the Boltzmann Integrals Create grid of pieces of a sphere Built around cartesian and spherical integration relationship: dvcart = ρ 2 sin(θ)dv sphere where dv cart = dxdydz and dv sphere = dρdθdφ

The Basic Problem. Spherical Quadrature Spherical Quadratures are more natural to use with the Boltzmann Integrals Create grid of pieces of a sphere Built around cartesian and spherical integration relationship: dvcart = ρ 2 sin(θ)dv sphere where dv cart = dxdydz and dv sphere = dρdθdφ

The Basic Problem. Spherical Quadrature Spherical Quadratures are more natural to use with the Boltzmann Integrals Create grid of pieces of a sphere Built around cartesian and spherical integration relationship: dvcart = ρ 2 sin(θ)dv sphere where dv cart = dxdydz and dv sphere = dρdθdφ

Numerical quadrature implementation Using the trapezoidal rule the Boltzmann integrals are computed; known values, from verified experiment, can be used to check the accuracy of the program. For the chosen verified experimental values,the Maxwellian distribution function, in the Boltzmann integrals, is known. Due to inter-dependancy the integration was separated into two portions; values from the first were utilized in the second. Then we can integrate, using a parallel implementation of the chosen quadrature, to get back the original bulk values Grid Refinement can then be used to improve upon the accuracy of the program and to measure convergence of the numerical quadrature method fρ 2 sin(θ)dv = Density S fv i ρ 2 sin(θ)dv S fc 2 ρ 2 sin(θ)dv S 3 Density = Temperature Density = U i c 2 = (V x U x ) 2 + (V y U y ) 2 + (V z U z ) 2 dv = dρdθdφ

Numerical quadrature implementation Using the trapezoidal rule the Boltzmann integrals are computed; known values, from verified experiment, can be used to check the accuracy of the program. For the chosen verified experimental values,the Maxwellian distribution function, in the Boltzmann integrals, is known. Due to inter-dependancy the integration was separated into two portions; values from the first were utilized in the second. Then we can integrate, using a parallel implementation of the chosen quadrature, to get back the original bulk values Grid Refinement can then be used to improve upon the accuracy of the program and to measure convergence of the numerical quadrature method fρ 2 sin(θ)dv = Density S fv i ρ 2 sin(θ)dv S fc 2 ρ 2 sin(θ)dv S 3 Density = Temperature Density = U i c 2 = (V x U x ) 2 + (V y U y ) 2 + (V z U z ) 2 dv = dρdθdφ

Numerical quadrature implementation Using the trapezoidal rule the Boltzmann integrals are computed; known values, from verified experiment, can be used to check the accuracy of the program. For the chosen verified experimental values,the Maxwellian distribution function, in the Boltzmann integrals, is known. Due to inter-dependancy the integration was separated into two portions; values from the first were utilized in the second. Then we can integrate, using a parallel implementation of the chosen quadrature, to get back the original bulk values Grid Refinement can then be used to improve upon the accuracy of the program and to measure convergence of the numerical quadrature method fρ 2 sin(θ)dv = Density S fv i ρ 2 sin(θ)dv S fc 2 ρ 2 sin(θ)dv S 3 Density = Temperature Density = U i c 2 = (V x U x ) 2 + (V y U y ) 2 + (V z U z ) 2 dv = dρdθdφ

Numerical quadrature implementation Using the trapezoidal rule the Boltzmann integrals are computed; known values, from verified experiment, can be used to check the accuracy of the program. For the chosen verified experimental values,the Maxwellian distribution function, in the Boltzmann integrals, is known. Due to inter-dependancy the integration was separated into two portions; values from the first were utilized in the second. Then we can integrate, using a parallel implementation of the chosen quadrature, to get back the original bulk values Grid Refinement can then be used to improve upon the accuracy of the program and to measure convergence of the numerical quadrature method fρ 2 sin(θ)dv = Density S fv i ρ 2 sin(θ)dv S fc 2 ρ 2 sin(θ)dv S 3 Density = Temperature Density = U i c 2 = (V x U x ) 2 + (V y U y ) 2 + (V z U z ) 2 dv = dρdθdφ

Numerical quadrature implementation Using the trapezoidal rule the Boltzmann integrals are computed; known values, from verified experiment, can be used to check the accuracy of the program. For the chosen verified experimental values,the Maxwellian distribution function, in the Boltzmann integrals, is known. Due to inter-dependancy the integration was separated into two portions; values from the first were utilized in the second. Then we can integrate, using a parallel implementation of the chosen quadrature, to get back the original bulk values Grid Refinement can then be used to improve upon the accuracy of the program and to measure convergence of the numerical quadrature method fρ 2 sin(θ)dv = Density S fv i ρ 2 sin(θ)dv S fc 2 ρ 2 sin(θ)dv S 3 Density = Temperature Density = U i c 2 = (V x U x ) 2 + (V y U y ) 2 + (V z U z ) 2 dv = dρdθdφ

Code Characteristics Code designed to run on Beacon: A next-generation Green 500 supercomputer based on the Intel R Xeon Phi TM coprocessor architecture Step 1 : Parallelize on the Xeon host processors 8 OpenMP threads, mapped to each of the 8 cores of the Xeon Processor MPI used to distribute processes to multiple processors micmpiexec used with -np flag to specify number of processors MPI_Bcast, MPI_Reduce, and MPI_Allreduce used as necessary to sum needed values and pass them around Step 2: Migrate code for use with Intel R Many Integrated Core TM architectures

Code Characteristics Code designed to run on Beacon: A next-generation Green 500 supercomputer based on the Intel R Xeon Phi TM coprocessor architecture Step 1 : Parallelize on the Xeon host processors 8 OpenMP threads, mapped to each of the 8 cores of the Xeon Processor MPI used to distribute processes to multiple processors micmpiexec used with -np flag to specify number of processors MPI_Bcast, MPI_Reduce, and MPI_Allreduce used as necessary to sum needed values and pass them around Step 2: Migrate code for use with Intel R Many Integrated Core TM architectures

Code Characteristics Code designed to run on Beacon: A next-generation Green 500 supercomputer based on the Intel R Xeon Phi TM coprocessor architecture Step 1 : Parallelize on the Xeon host processors 8 OpenMP threads, mapped to each of the 8 cores of the Xeon Processor MPI used to distribute processes to multiple processors micmpiexec used with -np flag to specify number of processors MPI_Bcast, MPI_Reduce, and MPI_Allreduce used as necessary to sum needed values and pass them around Step 2: Migrate code for use with Intel R Many Integrated Core TM architectures

Pseudocode Initialize distribution function, parameters and the underlying mesh Assign density and moment integration tasks to MPI ranks, via the mesh decomposition, and begin computations Sum and reduce density and moment integral computations, share results among MPI ranks. Utilize first stage results to compute the temperature integral on each MPI rank; reduce and sum the result from all ranks. Finalize and End

Pseudocode Initialize distribution function, parameters and the underlying mesh Assign density and moment integration tasks to MPI ranks, via the mesh decomposition, and begin computations Sum and reduce density and moment integral computations, share results among MPI ranks. Utilize first stage results to compute the temperature integral on each MPI rank; reduce and sum the result from all ranks. Finalize and End

Pseudocode Initialize distribution function, parameters and the underlying mesh Assign density and moment integration tasks to MPI ranks, via the mesh decomposition, and begin computations Sum and reduce density and moment integral computations, share results among MPI ranks. Utilize first stage results to compute the temperature integral on each MPI rank; reduce and sum the result from all ranks. Finalize and End

Pseudocode Initialize distribution function, parameters and the underlying mesh Assign density and moment integration tasks to MPI ranks, via the mesh decomposition, and begin computations Sum and reduce density and moment integral computations, share results among MPI ranks. Utilize first stage results to compute the temperature integral on each MPI rank; reduce and sum the result from all ranks. Finalize and End

Pseudocode Initialize distribution function, parameters and the underlying mesh Assign density and moment integration tasks to MPI ranks, via the mesh decomposition, and begin computations Sum and reduce density and moment integral computations, share results among MPI ranks. Utilize first stage results to compute the temperature integral on each MPI rank; reduce and sum the result from all ranks. Finalize and End

Defining Error The next step of this project was determining an appropriate method for evaluating error. The error in each of the three recordable values (ρ, θ, φ) was recorded The evaluated error was taken to be the absolute maximum of the error of the three values The evaluated error was then analyzed against computer time and relative fineness of grid

Defining Error The next step of this project was determining an appropriate method for evaluating error. The error in each of the three recordable values (ρ, θ, φ) was recorded The evaluated error was taken to be the absolute maximum of the error of the three values The evaluated error was then analyzed against computer time and relative fineness of grid

Defining Error The next step of this project was determining an appropriate method for evaluating error. The error in each of the three recordable values (ρ, θ, φ) was recorded The evaluated error was taken to be the absolute maximum of the error of the three values The evaluated error was then analyzed against computer time and relative fineness of grid

Defining Error The next step of this project was determining an appropriate method for evaluating error. The error in each of the three recordable values (ρ, θ, φ) was recorded The evaluated error was taken to be the absolute maximum of the error of the three values The evaluated error was then analyzed against computer time and relative fineness of grid

Experimental Approach We began with an initial grid, then refined in all three directions simultaneously by a factor of 2 Identified a number of gridpoints, beyond which there was a non-significant decrease in error Using this as an upper bound, began refinement in each direction, attempting to find a "sweet spot" for overall error with the lowest grid complexity

Experimental Approach We began with an initial grid, then refined in all three directions simultaneously by a factor of 2 Identified a number of gridpoints, beyond which there was a non-significant decrease in error Using this as an upper bound, began refinement in each direction, attempting to find a "sweet spot" for overall error with the lowest grid complexity

Experimental Approach We began with an initial grid, then refined in all three directions simultaneously by a factor of 2 Identified a number of gridpoints, beyond which there was a non-significant decrease in error Using this as an upper bound, began refinement in each direction, attempting to find a "sweet spot" for overall error with the lowest grid complexity

Accuracy vs. Fineness of Grid through determination of appropriate grid

Accuracy vs. Time of Computation through determination of appropriate grid

Relative MPI Speedup The test-case sphere had a radius of 10 In the best case, the number of gridpoints was (ρ, θ, φ) : (100,720,2880) The ideal speed up is for every time the number of MPI processes double, the time taken for the computation should halve. The data shows the point where number of communications made between processes affects time for computation

Further Goals and Applications Utilize offload statements to effectively use the Many Integrated Core Architecture of Beacon in the computation Further optimize to reduce amount of resources or number of communications required Apply code to other style problems that require a spherical quadrature

Appendix References and Acknowledgements R. G. Brook, A parallel, matrix-free newton method for solving approximate boltzmann equations on unstructured topologies, Ph.D. dissertation, University of Tennessee at Chattanooga, 2008. Work was conducted using the National Institute for Computational Sciences compute resources Supported in part by The University of Tennessee Knoxville, The Joint Institute for Computational Sciences, Oak Ridge National Lab, and The National Science Foundation