The Nanoscience End-Station and Petascale Computing

Size: px

Start display at page:

Download "The Nanoscience End-Station and Petascale Computing"

Osborne Cole
5 years ago
Views:

1 The Nanoscience End-Station and Petascale Computing Thomas C. Schulthess Computer Science and Mathematics Division & Center for Nanophase Materials Science DANSE kickoff meeting, Aug , 2006

SNS, CNMS, and NCCS - relevant user facilities SNS: - 10 2 increase in neutron scattering capability (flux &

sensitivity) - materials science - soft materials - magnetism - macromolecular systems - molecular biophysics -

(18TF) - 1K vector CNMS: - functional nanomaterials - macomolecular systems - nanofabrication - nanocatalysis -

2 SNS, CNMS, and NCCS - relevant user facilities SNS: increase in neutron scattering capability (flux & instr. sensitivity) - materials science - soft materials - magnetism - macromolecular systems - molecular biophysics - structural proteomics National Center for Computational Sciences NCCS: - IBM P4 (5TF); SGI/Xeon (9TF) - Cray X1E (18TF) - 1K vector CNMS: - functional nanomaterials - macomolecular systems - nanofabrication - nanocatalysis - nanomaterials theory > transport > magnetism/spintronics > carbon nanofibers > catalysis > electronic structure > atomistic simulations - Cray XT3 (25TF) - 5K opteron - Outlook: 100TF this fall; 250TF 2007/8; 1000TF in 2008/9

National Center for Computational Sciences Leadership Computing Facility (LCF)

5TF IBM SP4 Cheetah 4.5TF SGI Linux OIC 8TF IBM Linux NSTG.

5TF IBM HPSS Supercomputers 24,880 CPUs 52TB Memory 58 TFlops 21176 2.

1TB Memory (1376) 3.4GHz 2.6TB Memory (56) 3GHz 76GB Memory (128) 2.

3 National Center for Computational Sciences Leadership Computing Facility (LCF) Feb Systems Cray XT3 Jaguar 25TF Cray X1E Phoenix 18TF SGI Altix Ram 1.5TF IBM SP4 Cheetah 4.5TF SGI Linux OIC 8TF IBM Linux NSTG.3TF Visualization Cluster.5TF IBM HPSS Supercomputers 24,880 CPUs 52TB Memory 58 TFlops GHz 44 TB Memory GHz 2 TB Memory (256) 1.5GHz 2TB Memory (864) 1.3GHz 1.1TB Memory (1376) 3.4GHz 2.6TB Memory (56) 3GHz 76GB Memory (128) 2.2GHz 128GB Memory Many Storage Devices Supported Shared Disk 240TB 32TB 36TB 32TB 80 TB 4.5TB 9TB 5TB TB Scientific Visualization Lab 27-projector Power Wall Test Systems 96-processor Cray XT3 32-processor Cray X1E* 16-Processor SGI Altix Evaluation Platforms 144-processor Cray XD1 with FPGAs SRC Mapstation Clearspeed Backup Storage 5PB 5 PB

LCF plan for the next 5 years: Cray X1E Cray XT3 IBM Blue Gene Vector Arch Global memory Powerful CPU Estimating the Petaflop-scale Cray X1E system: Cray XT3 BG/L@LLNL: 360TF with 128K cores IBM BG

4 LCF plan for the next 5 years: Cray X1E Cray XT3 IBM Blue Gene Vector Arch Global memory Powerful CPU Estimating the Petaflop-scale Cray X1E system: Cray XT3 BG/L@LLNL: 360TF with 128K cores IBM BG (ANL) > achieve Petaflop with ~ 500K TBD cores 100 TF Possible ORNL scenario: > 25K sockets with 4 cores each Cluster Arch Low latency High bandwidth Scalability 100K CPU MB/CPU 250 TF 25 TF 18 Whatever happens, we have to 5 TF deal with ~100K cores TF for petaflop-scale systems at the end of the decade! 1000 TF

5 Characteristics of Computational Nanoscience Interdisciplinary (like most science in the 21 st century) Builds on established domains like physics, chemistry, materials science, and biology (legacy codes). High performance computing will be a key component providing many opportunities. Computer architectures are increasingly complex and specialized. It will take large teams to use them. Since nanoscience is still an emerging field, computational nanoscience has to be extensible and reconfigurable.

Materials Science Materials, Chemistry, Physics : Math : Computer Science Open Source

6 Large Scientific User Facilities Neutron Reflectometer Ultra-high vacuum station Sample Users: high impact science Facility Instrumentation Computational Endstation for Nano- &? Materials Science Materials, Chemistry, Physics : Math : Computer Science Open Source Repository Generic Tool Kit Unified I/O systems Optimized Kernels Users: high HPC Users impact science

7 Computational Endstation for Nanoscience Step 1: endstation allocation on the NLCF {X1E: 300Kh; XT3: 3.5Mh; SGI Altix: unlimited} for high impact projects - High temperature superconductivity (production) - Maier, Kent, Jarrell, Schulthess - Spintronics (production) - Alvarez, Moreo, Dagotto, Schulthess - Nanomagnetism (production) - Eisenbach, Nicholson, Stocks, Kent, Schulthess - Physicochemical mechanism of mutating DNA under radiation (pilot) - Kent, Landman (UGA) - Molecular electronics (pilot) - Bernholc et al. (NCSU) Step 2: Systematically evolve software in to highperformance, stable, readily accessible instrumentation - high performance kernels - generic toolkit for nanoscience (extending C++/STL) - unified I/O system (XML based - incl. tools for accessibility from Fortran legacy codes) - visualization Step 3: integrate with user program of ORNL s Center for Nanophase Materials Sciences (CNMS)

8 What do we really need to study FePt nanoparticles (and other nanosystems)? E = KV sin 2 Θ mhcos 2 Θ H Θ H Θ FePt Take advantage of (atomic) degrees of freedom, ( s 1, s 2,..., s N ) in order to manipulate macroscopic properties m = 1/N i s i F (T, m) = E(T, m) k B T lnw (E, m)

9 The basic idea of our approach F (T, X) = E(T 0, X) k B T lnw (E, X) Compute energy with ab initio codes: LSMS: > 80% efficiency runs on ~1000 units VASP: ~50% efficiency runs on ~500 units (1 unit = 1 core, 4 cores,...) Compute density of states with extended Wang- Laudau method Zhou, Schulthess, Torbrügge, and Landau (Phys. Rev. Lett , 2006) Driver for LSMS and VASP With LSMS & petaflop: magnetic free energy surface for 500 atom nanoparticle - possible in 2009

10 Kohn-Sham Density Functional Theory (DFT) Easy in reciprocal space Easy in real space [ V (r)]ψ i = ɛ i ψ i V = F [ρ] +... ρ(r) = ψ i (r) 2 Self-consistent eigenproblem for {ɛ i }, {ψ i } Convenient evaluation of Hamiltonian in plane-wave basis. Use FFT for transformations. ψ = G C G e igr Many codes: VASP, PWSCF/ESPRESSO, CPMD, PARATEC, CASTEP, QBOX, ABINIT,...

11 Parallel FFT layouts 1 grid over four processors 2Ecut 4 grids on four processors Plane-waves chosen within cutoff radius Ecut: sparse basis in reciprocal (frequency) space Different distribution methods hybrid parallel. All bands simultaneous methods essential FFTs are small but many e.g. ( )^3 grid

12 VASP Very popular plane-wave DFT code PAW, ultrasoft pseudopotential DFT F90, MPI, BLAS, LAPACK, SCALAPACK Canonical 3D parallel FFTs Here: No major surgery. No heroics. Small diffs.

Fe399Pt408 Benchmark Exclude initialization 3 iterations (50+ for convergence) Spin polarized ferromagnetic solution. LDA. Fe 8+,Pt 10+ cores. 4864 orbitals per spin inc.

13 Fe399Pt408 Benchmark Exclude initialization 3 iterations (50+ for convergence) Spin polarized ferromagnetic solution. LDA. Fe 8+,Pt 10+ cores orbitals per spin inc. unoccupied PAW 19.6Ry/268eV cutoff 30.8x30.8x29.7 Angstrom supercell 126x126x120 FFT grid (defaults) Gamma point code (halved grid) ~ plane waves/orbital Hybrid parallel

14 Timings: Davidson Time/s (Faster is better) X1E XT3 P No. processors At 256 processors: X1E 1.7x XT3, 3.3x P690

15 Timings: RMM-DIIS Time/s (Faster is better) X1E P No. processors At 256 processors: X1E 2.8x P690

16 Profiling VASP on Cray X1E ~20% of peak at 128 MSPs (whole application) ~25% BLAS+LAPACK, ~25% FFTs 2 Key problems: - Scaling: Limited by global linear algebra Eigenvector solutions in subspace rotations - Single processor (MSP on Cray X1E) performance limited by short average vector length (33 for 128 MSPs)

Scaling Time/s 150.0 112.5 75.0 37.5 EDDIAG ORTHCH 0 64 128 256 512 No.

17 Scaling Time/s EDDIAG ORTHCH No. MSPs Not a platform specific problem Turn-over due to SCALAPACK solve for all eigenvectors of dense diagonally dominant ~5000x5000 matrix. Small matrix compared to number of processors Need improved algorithms, tuned code, e.g. iterative Jacobi methods (Ian Bush/Daresbury). Suggestions welcome!

18 Single MSP performance BLAS performs well (>10 GFLOP/s) Limited by short vector lengths in FFTs (generic), realspace pseudopotential evaluation (code specific) Here: Focus on FFT performance The current code is poorly structured for the X1E: No easy access to lots of data. Common to other DFT codes. - In PAW method, FFT dimensions are small. - No blocking or multiple transforms - No exploit of data locality e.g. if on 1 processor - Explicitly MPI. Awkward to insert CAF

19 Plane wave FFT module Today Application code 3D FFT Future Application code Multiple FFT module 1D FFT MPI ND FFT MPI or?? Advantage: Other DFT codes benefit Norm Troullier/CRAY has written vectorized multi-streaming FFTs. Not connected yet.

20 Software development: direct user community natural but modern path User Community / Other Software Frameworks STATUS I/O I/O Common I/O system XML I/O Prototype App. Code App. Code App. Code App. Code App. Code App. Code Generic toolkits Optimized kernels App. Code Combination of User-developed and Code Repository Ψ-Mag, ALPS Current Research Using Cray, BG/L Basic Libraries BLAS, FFT, etc. Today Current Research Future - high performance kernels - generic toolkit for nanoscience (extending C++/STL) - unified I/O system (XML based - incl. tools for accessibility from Fortran legacy codes) - visualization

21 Results Atoms colored by moment 807 atoms 128 MSPs 600 CPU hours FM electronic structure+forces Publishable accuracy

22 Strong size effects in magnetic moments 43 atoms 55 atoms Clear non-bulk behavior in small clusters 201 atoms But! AFM or ferrimagnetic states are lowest energy O(10 mev/atom) for relaxed geometries Magnetic moment

23 3 Fe Bulk Magnetic moment (u B ) 2 Near-surface Fe atoms have enhanced moment 1 Relaxations can be significant. AF spins! Fully relaxed Bulk 807 atoms Distance from centre (A) Pt Bulk

24 Proton transfer in H2 O on TiO2 - interpretation of quasielastic neutron scattering VASP runs with ~ 1000 atoms reaching ~ 10 ns study proton transfer in water on TiO2 CNMS user project by Jorge Soffo turn calculations around in about one month

25 Nanoscience end-station Supported and maintained by NTI of the CNMS Capability: LCF Cray supercomputers Capacity computing: multi-teraflop Beowulf cluster, and allocation at NERSC Supported capabilities: - MD, MC (flexible models with Ψ-Mag toolkit) - QMC, Quantum cluster methods, Hubbard, spin-fermion - DFT (LDA, SIC-LSD), VASP, other electronic structure codes - Future plans: D-QMC, AF-QMC Available to users via CNMS user projects - see

26 The team / collaborators End-station concept: - ORNL: Peter Cummings (CNMS), Malcolm Stocks (MST) - Ames Lab: Bruce Harmon FePt nanoparticles and gen. Wang Landau: - ORNL: Paul Kent (CSMD), Cheggang Zhoug (CNMS), Mark Fahey (NCCS), Don Nicholson (CSMD), Markus Eisenbach (MST) - Univ. of Georgia at Athens: David Landau - Cray: Nathan Wichmann, Norm Troulier, Jeff Larkin, and John Leveske Software infrastructure - ORNL: Mike Summers (CSED), Xiuping Tao (CSM) - and the above - Florida State: Greg Brow - Univ. of Tennessee: Tom Swain and Kirck Sayer

27 Acknowledgment This research was conducted at the Center for Nanophase Materials Sciences, which is sponsored by the Division of Scientific User Facilities of the United States Department of Energy (DOE). It was supported in part by the Laboratory Research and Development fund at ORNL. The research was enabled my computational resources of the National Center for Computational Sciences, which is sponsored by the Office of Advanced Scientific Computing Research.

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab