Stochastic Modelling of Electron Transport on different HPC architectures

Similar documents
Monte Carlo Methods for Electron Transport: Scalability Study

High-Performance Computing and Groundbreaking Applications

The Role of Annihilation in a Wigner Monte Carlo Approach

Modeling of Carrier Transport in Nanowires

GPU Computing Activities in KISTI

One Optimized I/O Configuration per HPC Application

MONTE CARLO METHOD FOR MODELING OF ELECTRON TRANSPORT IN QUANTUM WIRES

A Data Communication Reliability and Trustability Study for Cluster Computing

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Investigation of an Unusual Phase Transition Freezing on heating of liquid solution

The Memory Intensive System

Red Sky. Pushing Toward Petascale with Commodity Systems. Matthew Bohnsack. Sandia National Laboratories Albuquerque, New Mexico USA

Quantum ESPRESSO Performance Benchmark and Profiling. February 2017

Perm State University Research-Education Center Parallel and Distributed Computing

ArcGIS Deployment Pattern. Azlina Mahad

Applicability and Robustness of Monte Carlo Algorithms for Very Large Linear Algebra Problems. Ivan Dimov

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Julian Merten. GPU Computing and Alternative Architecture

Scalable and Power-Efficient Data Mining Kernels

Continuous Machine Learning

Position Papers of the 2013 Federated Conference on Computer Science and Information Systems pp

From Supercomputers to GPUs

Knowledge Discovery and Data Mining 1 (VO) ( )

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

Computationally Efficient Analysis of Large Array FTIR Data In Chemical Reaction Studies Using Distributed Computing Strategy

ab initio Electronic Structure Calculations

A CUDA Solver for Helmholtz Equation

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan

The Green Index (TGI): A Metric for Evalua:ng Energy Efficiency in HPC Systems

Hellenic National Meteorological Service (HNMS) GREECE

Parallelization of the Molecular Orbital Program MOS-F

Amalendu Chandra. Department of Chemistry and Computer Centre.

Direct Self-Consistent Field Computations on GPU Clusters

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Performance Analysis of Parallel Alternating Directions Algorithm for Time Dependent Problems

Software optimization for petaflops/s scale Quantum Monte Carlo simulations

WRF performance tuning for the Intel Woodcrest Processor

Domain Decomposition-based contour integration eigenvalue solvers

Quantum computing with superconducting qubits Towards useful applications

Presentation Outline

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

VMware VMmark V1.1 Results

Cluster Computing: Updraft. Charles Reid Scientific Computing Summer Workshop June 29, 2010

The QMC Petascale Project

CRYPTOGRAPHIC COMPUTING

Population Estimation: Using High-Performance Computing in Statistical Research. Craig Finch Zia Rehman

Leveraging Web GIS: An Introduction to the ArcGIS portal

Benchmark of the CPMD code on CRESCO HPC Facilities for Numerical Simulation of a Magnesium Nanoparticle.

Chile / Dirección Meteorológica de Chile (Chilean Weather Service)

Modelling and computer simulation of nanostructured devices

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

A Tale of Two Erasure Codes in HDFS

From Piz Daint to Piz Kesch : the making of a GPU-based weather forecasting system. Oliver Fuhrer and Thomas C. Schulthess

arxiv: v1 [hep-lat] 10 Jul 2012

Introduction to Portal for ArcGIS. Hao LEE November 12, 2015

Plans for Unprecedented Imaging of Stellar Surfaces with the Navy Precision Optical Interferometer (NPOI)

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Calculation of ground states of few-body nuclei using NVIDIA CUDA technology

CS425: Algorithms for Web Scale Data

Massive Parallelization of First Principles Molecular Dynamics Code

Reliability at Scale

Current Status of Chinese Virtual Observatory

Introduction to Portal for ArcGIS

Portal for ArcGIS: An Introduction. Catherine Hynes and Derek Law

JOINT WMO TECHNICAL PROGRESS REPORT ON THE GLOBAL DATA PROCESSING AND FORECASTING SYSTEM AND NUMERICAL WEATHER PREDICTION RESEARCH ACTIVITIES FOR 2007

A Spatial Data Infrastructure for Landslides and Floods in Italy

Simulation Laboratories at JSC

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Minimization of Energy Loss using Integrated Evolutionary Approaches

1 Brief Introduction to Quantum Mechanics

Supercomputer Programme

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Portal for ArcGIS: An Introduction

INITIAL INTEGRATION AND EVALUATION

Using AmgX to accelerate a PETSc-based immersed-boundary method code

Outline. policies for the first part. with some potential answers... MCS 260 Lecture 10.0 Introduction to Computer Science Jan Verschelde, 9 July 2014

Web GIS Deployment for Administrators. Vanessa Ramirez Solution Engineer, Natural Resources, Esri

ww.padasalai.net

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Unidata Community Equipment Awards Cover Sheet. Proposal Title: Upgrading the Rutgers Weather Center to Meet Today s Needs

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

JOINT WMO TECHNICAL PROGRESS REPORT ON THE GLOBAL DATA PROCESSING AND FORECASTING SYSTEM AND NUMERICAL WEATHER PREDICTION RESEARCH ACTIVITIES FOR 2006

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

P214 Efficient Computation of Passive Seismic Interferometry

Parallel Eigensolver Performance on High Performance Computers 1

Solving RODEs on GPU clusters

A simple Concept for the Performance Analysis of Cluster-Computing

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Infrastructure Automation with Salt

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Reducing The Computational Cost of Bayesian Indoor Positioning Systems

Transcription:

Stochastic Modelling of Electron Transport on different HPC architectures www.hp-see.eu E. Atanassov, T. Gurov, A. Karaivan ova Institute of Information and Communication Technologies Bulgarian Academy of Science (emanouil, gurov, anet)@parallel.bas.bg Supported by SuperCA++, Grant #ДЦВП02/1 with NSF of Bulgaria

OUTLINE Bulgarian and regional HPC resourses Monte Carlo modelling of semiconductor devices Improvements to Monte Carlo Numerical results Conclusions and future work

Bulgarian HPC Infrastructure The biggest HPC resources for research in Bulgaria is the supersupercomputer IBM BlueGene/P with 8192 cores Two HPC clusters with Intel CPUs and Infiniband interconnection at IICT-BAS and IOCCP-BAS 8196 CPU cores 576 CPU cores 4x 480 GPU cores - vendors: HP and Fujitsu In addition GPU-enabled servers equipped with state of the art GPUs are available for applications that can take advantage of them. 1 Gb/s Ethernet fiber optics links between centers 1 Gbps 100 Mbps 800 CPU cores HPC Linux Cluster

Bulgarian HPC Resources HPC Cluster at IICT-BAS 3 chassis HP Cluster Platform Express 7000, 36 blades BL 280c, dual Intel Xeon X5560 @ 2.8Ghz (total 576 cores), 24 GB RAM 8 servers HP DL 380 G6, dual Intel X5560 @ 2.8 GHz, 32 GB RAM Fully non-blocking DDR Infiniband interconnection Voltaire Grid director 2004 nonblocking DDR Infiniband switch, 2 disk arrays with 96 TB, 2 lustre fs Peak performance 3.2 TF, achieved performance more than 3TF, 92% efficiency. HP ProLiant SL390s G7 Server with 4 M2090 graphic cards

Regional HPC Infrastructure HP-SEE project provides access to regional HPC centers: BlueGene/P in Romania, 4096 cores several HPC clusters with Infiniband one SMP machine with 1152 cores, 6 TB RAM, 10TF, Intel Xeon X7542 (Nehalem EX), @ 2.67GHz GPU capabilities being added in several installations.

Simulation of electron transport in semiconductors Application area: SET is developed for solving various computationally intensive problems which describe ultrafast carrier transport in semiconductors. Expected results and their consequences studies memory and quantum effects during the relaxation process due to electron-phonon interaction in semiconductors; present version explores electron kinetics in GaAs nano-wires. Studying the quantum effects that occur at nanometer and femtosecond scale have important scientific results - novel advanced methods, investigation of novel physical phenomena

Quantum-kinetic equation (inhomogeneous case) The integral form of the equation: Kernels:

Quantum-kinetic equation (cont.) Bose function: The phonon energy (ħω) depends on : Electron energy: The electron-phonon coupling constant according to Fröhlich polar optical interaction: The Fourier transform of the square of the ground state wave function:

Monte Carlo method Backward time evolution of the numerical trajectories Wigner function: Energy (or momentum) distribution: Density distribution:

Monte Carlo Method (cont.) Biased MC estimator: Weights: The Markov chain: Initial density function Transition density function:

Monte Carlo ξs[jg(f)] = g(z,kz,t)/pin(z,kz,t)w0fw,0(.,kz,0) + g(z,kz,t)/pin(z,kz,t) j=1swjαfw,0 (., kz,jα, tj), where fw,0(.,kz,jα,tj)= fw,0(z + h(kz,j 1,q z,j,tj 1,t j,tj),kz,j,tj), if α = 1, fw,0(z + h(kz,j 1,q z,j,tj 1,t j,tj),kz,j 1,tj), if α = 2 Wjα = Wj 1αKα(kzj 1,kj,tj,tj)/(pαptr(kj 1,kj,tj,tj)), W0α=W0=1, 2, j = 1,..., s. 1/N i=1 N (ξs[jg(f)])i Jg(f) α = 1,

Monte Carlo modelling of semiconductor devices The variance increases exponentially with respect to the relaxation time T. The application requires accumulating the results of billions of trajectories Improvements in variance and execution time can be achieved with low-discrepancy sequences (quasirandom numbers). The use of quasirandom numbers requires a robust and flexible implementation, since it is not feasible to ignore failures and missing results of some trajectories, unlike in Monte Carlo. GPU resources are efficient in computations using the lowdiscrepancy sequences of Sobol, Halton, etc. Variance reduction in case of pure MC can be achieved using different transition density functions.

Quasirandom approach We adopted a hybrid approach, where evolution times are sampled using modified Halton sequence, and space parameters are modeled using pseudorandom sequences Scrambled modified Halton sequence [Atanassov 2003]: xn(i) = j=0m imod (aj(i)kij+1 + bj(i),pi) pi j-1 (scramblers bj(i), modifiers ki in [0, pi 1] ) The use of quasirandom numbers offers significant advantage because the rate of convergence is almost O(1/N ) vs O(1/sqrt(N)) for regular pseudorandom numbers. The disadvantage is that it is not acceptable to lose some part of the computations and it therefore the execution mechanism should be more robust and lead to repeatable results.

Monte Carlo modelling of semiconductor devices Variance reduction approach because of the high variance, it is justified to study and optimize the transfer functions. Thus a parallel version of the genetic optimisation library galib was developed and succesfully run on the BlueGene/P. It was used to optimise the transfer function related to the evolution time (instead of constant). So far gains are not more than 20% but we are considering the possibility to optimise the other kernels, which are more complex and probably will lead to better results.

Monte Carlo modelling of semiconductor devices Various physically interesting quantities, expressed as linear functionals of the solution for the wigner function, can be computed. Example results for 175fs relaxation times

Numerical results Results on Blue Gene/P Cores Time Seconds 2048 3:21:22 12082 1024 6:31:38 23498

Numerical results Results with electric field, 180fs, on Intel X5560 @2.8Ghz, Infiniband cluster Nodes Cores Time Seconds Samples 8 64 7:03:43 25423 10^9 8 128 5:16:02 18962 10^9 16 128 3:31:45 12705 10^9 16 256 2:39:12 9552 10^9 1 1 27:07 1627 10^6

Numerical results Time evolution

Using cloud storage for results Users register at web portal and obtain access to cloud storage at IICT-BAS Access via windows or linux app Can use curl or libcurl clients from BlueGene/P home directory has 72G free and is 97% used

Status of GPU-based version Generators for the scrambled Sobol sequence and modified Halton sequence have been developed and tested. For Monte Carlo we use CURAND. Code tested on our PC cluster of GTX 295, our M2090 cards and Amazon EC2 nodes equiped with M2050 cards (2$ per hour). The code has been refactored to enable the main computations to be put in a GPU kernel function. One kernel, related to initialization of pseudo-random or quasirandom numbers, invoked once. Recent results the code compiles. What remains to be done verification, testing and performance tuning.

Conclusions and future work The code have excellent scalability on clusters and supercomputers. Considering that the problem at hand is highly CPU intensive, it is justified to attempt to tune the transition densities before moving to more demanding computations. Access to cloud storage provides simple security model (signed http requests) which also offers easy deployment across all the available architectures.