S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems
|
|
- Benjamin Lambert Cook
- 5 years ago
- Views:
Transcription
1 S Subdivide, : Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems Elmar Westphal - Forschungszentrum Jülich GmbH 1
2 Contents Micromagnetism TetraMag, a FEM/BEM Micromagnetism Simulator Porting TetraMag to CUDA Porting TetraMag to multi-gpu CUDA Benchmarks 2
3 Micromagnetism In Micromagnetism, we investigate The structure of ferromagnetic domains. The structure, dynamics and motion of domain walls. Structure and dynamics of magnetic vortices. Spin waves, etc As a mesoscopic theory, it can provide a link between simulation and experiment 3
4 Magnetism on Different Length Scales magnetic nanostructures Macroscopic models hysteresis models, response to external parameters Domain theory subdivision into domains, details of the magnetic structure are neglected Micromagnetism continuum theory, domain walls, vortices Quantum theory electronic structure 4 Heisenberg model atomistic effects, spin chains
5 Magnetism on Different Time Scales 5
6 Most Recent Achievement Discovery of the Spin Cherenkov Effect [1] (magnetism equivalent to the sonic boom) Geometry: - 2 µm x 1 µm x 1 µm Permalloy prism - 5nm resolution (100 million tetrahedrons, 16 million discretisation nodes) [1] M. Yan, A. Kákay, C. Andreas, & R. Hertel. Spin Cherenkov effect and magnonic Mach cones. Physical Review B, Rapid Communications, 88, (R) (2013) 6
7 About TetraMag Code started by Riccardo Hertel [1], extended and ported by Attila Kakay, Elmar Westphal [2][3] Upcoming: details about Calculation steps Matrix properties Older CUDA versions New challenges [1] Hertel, R. (2001). Micromagnetic simulations of magnetostatically coupled Nickel nanowires. Journal of Applied Physics, 90(11), [2] Kakay, A., Westphal, E., & Hertel, R. (2010). Speedup of FEM micromagnetic simulations with graphical processing units. Magnetics, IEEE Transactions on, 46(6), [3] MagnetizationDynamics/Simulations/_node.html 7
8 Calculation Steps Calculate magnetostatic fields with scalar potential Split into two parts U = U1 + U2 U1 is the solution of the inhomogenous Neumann problem U2 is to satisfy Laplace s equation with Dirichlet boundary conditions Solve/Integrate the Landau-Lifshitz-Gilbert equation of motion 8
9 Magnetostatic Field Calculated in 3 steps: Iterative solution of a sparse linear system for U1 Dense/hierarchical matrix-vector multiplication to obtain U2 in the boundary regions (not covered in this talk) Iterative solution of a sparse linear system for U2 within the magnetic region Sparse systems are solved using multi-gpu linbcg and bicgstab solvers 9
10 Time Integration Includes field calculations and vector operations Performed using CVODE from the Sundials package CVODE s NVector -structure and its operations have been ported for CUDA on single-host-multi-gpu systems Memory consuming (~1KB/node, limiting factor for GPU usage) Field calculations use a sparse matrix and several field vectors CVODE internally needs many helper vectors 10
11 Properties of TetraMag s Sparse Matrices Contain ~15 non-zero elements per row/column Distribution of elements depends on the underlying mesh (regular to seemingly random) Symmetric (for magnetostatic field calculation) OR Asymmetric (for exchange field calculation) Static over the whole program run 11
12 TetraMagCUDA Around since ~2009 (see GTC 2010 poster[1]) CUDA parts were piggy-backed onto CPU-routines Tries to copy as many sparse matrices to device memory as possible, the remainder stays on the CPU GPU-only execution limited to problem sizes of ~1M nodes Sufficient for most use cases at the time 12 [1] /Q02-Massively-Parallel-Micromagnetic-FEM- Calculations-with-Graphical-Processing-Units.pdf
13 New Challenges Simulations at experimental scale (µm) require larger scale simulations (10M+ nodes) Single matrix + solver/integrator-vectors often exceed device memory Copying matrices sequentially not sufficient, Possible solutions: Reduce memory footprint Distribute problem over multiple GPUs 13
14 Reduce Memory Footprint Use symmetry Effective, but not possible for all matrices Often needs atomic operations (slow) Reduce precision of off-diagonal elements Moderately effective (8+4 -> 4+4 bit/value) May lead to unacceptable loss of precision 14
15 Reduce Memory Footprint Extract diagonals Good for performance (coalesced access) Very good results for symmetric matrices based on regular meshes (2 x (8+4) -> 8 bit) Mixed results otherwise Combinations of some/all of the above 15
16 Preprocessing Approaches for Matrix Distribution Matrices can be preprocessed for GPU(s) in different ways Traditional single GPU calculation Naive multi-gpu distribution checkerboard-style distribution 16
17 Traditional MVM on a Single GPU X = X
18 Traditional MVM on a Single GPU X = X
19 Distribution over Multiple GPUs Naive approach: Divide matrix into N GPU sub-matrices with N row /N GPU rows Copy one sub-matrix to each GPU Copy vector to all GPUs Perform partial multiplications Copy partial results to all other GPUs Repeat (if needed) 17
20 Naive Distribution, Preparation and First Multiplication X = 18
21 Naive Distribution, Preparation and First Multiplication X = copy partial matrices to other GPUs 18
22 Naive Distribution, Preparation and First Multiplication X = copy partial matrices to other GPUs 18 copy vectors to other GPUs
23 Naive Distribution, Preparation and First Multiplication X = copy partial matrices to other GPUs 18 copy vectors to other GPUs calculate
24 Naive Distribution, Subsequent Multiplications X = 19
25 Naive Distribution, Subsequent Multiplications X = 19 copy vectors to other GPUs
26 Naive Distribution, Subsequent Multiplications X = 19 copy vectors to other GPUs calculate
27 Naive Distribution Approach Pro Easy to implement Con All sub-matrices need vector data from all GPUs at the beginning of calculation Data transfer overhead about as expensive as actual calculation -> performance often below single GPU solution 20
28 Checkerboard Approach Split each sub-matrix into N GPU sub-sub-matrices Each of these needs vector values from only one GPU Perform multiplication of first sub-sub-matrix with partial vector At the same time, copy vector part needed for next multiplication into a (double)-buffer in a different stream Repeat for other sub-sub matrices 21
29 Checkerboard -Style MVM Partial matrix on GPU X = needs vector data from 22 part. vectors double part. results buffer
30 Checkerboard -Style MVM Partial matrix on GPU X = needs vector data from 22 part. vectors double part. results buffer
31 Checkerboard -Style MVM Partial matrix on GPU X = needs vector data from 22 part. vectors double part. results buffer
32 Checkerboard -Style MVM Partial matrix on GPU X = needs vector data from 22 part. vectors double part. results buffer
33 Checkerboard -Style MVM Partial matrix on GPU X = needs vector data from 22 part. vectors double part. results buffer
34 Matrix Preprocessing Original Matrix in CSR-like format Blocks of Nwarpsize rows are transposed to enable coalesced memory access Distribution of data destroys uniformity of row lengths Zero padding may be necessary -> wasted memory Rows are sorted and re-indexed by number of nonzero elements -> minimal padding 23
35 Optimizing Data Transfers Depending on the problem, sub-sub-matrices can get very small and access only very few vector elements Multiplication time is short Transfer of potentially unneeded elements takes a much longer time Solution: transfer only those elements that are really needed 24
36 Creating Export Vectors Each GPU needs an individual set of vector elements from every other GPU: Approach one: create NGPU x (NGPU-1) export vectors consumes much memory Approach two: rewrite export vector NGPU x (NGPU-1) times May need to read/write all elements NGPU x (NGPU-1) times Approach three: do some (lots of) work in preprocessing 25
37 Building the Perfect(?) Set of Export Vector During preprocessing: Build a key value out of which other GPUs need a certain vector element Using one bit for each target GPU results in at most 2 N GPU-1 key values (usually NGPU<=8 in a node) Use values as sort key, re-index in sorted order Store block offsets for keys 26
38 Building the Perfect(?) Set of Export Vector (cont.) During multiplication: Build export vector according to the index generated in preprocessing During loop over all sub-sub-matrices: Loop over all export blocks Asynchronously copy blocks with elements needed from next GPU into buffer 27
39 Key values are calculated relative to originating GPU: GPU
40 Key values are calculated relative to originating GPU: GPU vector elements accessed in cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100)
41 Key values are calculated relative to originating GPU: GPU vector elements accessed in cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100) =4
42 Key values are calculated relative to originating GPU: GPU vector elements accessed in cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100) = =3
43 Key values are calculated relative to originating GPU: GPU vector elements accessed in cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100) = =3 cols 6-8 are originally on GPU 2, relative key binary values: 2(010) 4(100) 0 1(001) =5
44 build export buffer during preprocessing: keys from matrix for GPU 0 index ref. key
45 build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index ref. key export ref. key
46 build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index ref. key export ref. key binary
47 build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index ref. key export ref. key binary blocks sent to export streams during multiplication loop: 29
48 build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index ref. key export ref. key binary blocks sent to export streams during multiplication loop: exported to GPU 1 (bit 001) index binary
49 build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index ref. key export ref. key binary blocks sent to export streams during multiplication loop: exported to GPU 1 (bit 001) index binary exported to GPU 2 (bit 010) index binary
50 build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index ref. key export ref. key binary blocks sent to export streams during multiplication loop: exported to GPU 1 (bit 001) index binary exported to GPU 2 (bit 010) exported to GPU 3 (bit 100) index binary index binary
51 Pros and Cons Pro: Export vector is only built once per multiplication No element needs to be stored more than once No element is transferred more often than necessary Con: Limited number of GPUs, because number of blocks grows exponentially Time-consuming pre-processing 30
52 Further Optimisations Sub-sub-matrix multiplications are ordered by size of matrix (number of non-zero elements) More likely to correlate larger transfers with longer calculations Export vectors are copied to host memory during initial calculations Allows parallel import on devices with only one copy engine 31
53 Preprocessing Time Preprocessing involves multiple expensive indexing and sorting steps May take up to seconds per matrix (with n ~20M) Depends on the number of GPUs used Happens only once, because matrices are static Typical run includes millions of solver iterations/matrix multiplications 32
54 Benchmarks Solver iteration times for Cube with regular mesh, very diagonal-dominant matrices, little data transfer Two problems of similar size and nature: Round disk with irregular mesh structure, few extractable diagonals, lots of data transfer Round disk with partially regular mesh structure, some extractable diagonals, moderate data transfer 33
55 Time per solver integration for cube 1µm (8.1M nodes, regular mesh) 30 Comparison: 2 x E (10 cores, 2.8 GHz): 134 ms 25 time per solver iteration [ms] number of GPUs Titan GTX
56 Time per solver integration for disk (6.x M nodes, different mesh structures) 40 Comparison: 2 x E (10 cores, 2.8 GHz): 100 ms reg. 118 ms irreg. time per solver iteration [ms] number of GPUs Titan irreg.mesh GTX 690, irreg.mesh GTX 690, 1GPU/card, irreg.mesh 35 Titan reg.mesh GTX 690 reg.mesh
57 Conclusions Distribution of matrices and vectors over multiple GPUs allows us to simulate significantly larger samples Performance scaling depends largely on the amount of data exchanged between GPUs Optimising the mesh-structure is very important in multi-gpu setups 36
58 Questions? 37
Explore Computational Power of GPU in Electromagnetics and Micromagnetics
Explore Computational Power of GPU in Electromagnetics and Micromagnetics Presenter: Sidi Fu, PhD candidate, UC San Diego Advisor: Prof. Vitaliy Lomakin Center of Magnetic Recording Research, Department
More informationA Hybrid Method for the Wave Equation. beilina
A Hybrid Method for the Wave Equation http://www.math.unibas.ch/ beilina 1 The mathematical model The model problem is the wave equation 2 u t 2 = (a 2 u) + f, x Ω R 3, t > 0, (1) u(x, 0) = 0, x Ω, (2)
More informationA CUDA Solver for Helmholtz Equation
Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College
More informationCoupling atomistic and continuum modelling of magnetism
Coupling atomistic and continuum modelling of magnetism M. Poluektov 1,2 G. Kreiss 2 O. Eriksson 3 1 University of Warwick WMG International Institute for Nanocomposites Manufacturing 2 Uppsala University
More informationSOLUTION of linear systems of equations of the form:
Proceedings of the Federated Conference on Computer Science and Information Systems pp. Mixed precision iterative refinement techniques for the WZ factorization Beata Bylina Jarosław Bylina Institute of
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationParallel Transposition of Sparse Data Structures
Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing
More informationSPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics
SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS
More informationHybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS
Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationFaster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs
Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline
More informationS XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library
S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,
More informationUsing a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics
Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationH ψ = E ψ. Introduction to Exact Diagonalization. Andreas Läuchli, New states of quantum matter MPI für Physik komplexer Systeme - Dresden
H ψ = E ψ Introduction to Exact Diagonalization Andreas Läuchli, New states of quantum matter MPI für Physik komplexer Systeme - Dresden http://www.pks.mpg.de/~aml laeuchli@comp-phys.org Simulations of
More informationFinite element micromagnetics
Finite element micromagnetics Thomas Schrefl, Dieter Suess, Werner Scholz, Hermann Forster, Vassilios Tsiantos, and Josef Fidler Vienna University of Technology, Institute of Applied and Technical Physics,
More informationNumerical Analysis Fall. Gauss Elimination
Numerical Analysis 2015 Fall Gauss Elimination Solving systems m g g m m g x x x k k k k k k k k k 3 2 1 3 2 1 3 3 3 2 3 2 2 2 1 0 0 Graphical Method For small sets of simultaneous equations, graphing
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationProposal for a standard micromagnetic problem: Spin wave dispersion in a magnonic waveguide
1 Proposal for a standard micromagnetic problem: Spin wave dispersion in a magnonic waveguide G. Venkat, D. Kumar, M. Franchin, O. Dmytriiev, M. Mruczkiewicz, H. Fangohr, A. Barman, M. Krawczyk, and A.
More informationJ.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009
Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.
More informationSparse LU Factorization on GPUs for Accelerating SPICE Simulation
Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,
More informationResearch on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method
NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),
More informationMatrix Assembly in FEA
Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,
More informationImprovements for Implicit Linear Equation Solvers
Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often
More informationCOMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD
XVIII International Conference on Water Resources CMWR 2010 J. Carrera (Ed) c CIMNE, Barcelona, 2010 COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD James.E. McClure, Jan F. Prins
More informationQuantum Lattice Models & Introduction to Exact Diagonalization
Quantum Lattice Models & Introduction to Exact Diagonalization H! = E! Andreas Läuchli IRRMA EPF Lausanne ALPS User Workshop CSCS Manno, 28/9/2004 Outline of this lecture: Quantum Lattice Models Lattices
More informationParallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationThe GPU code FARGO3D: presentation and implementation strategies
The GPU code FARGO3D: presentation and implementation strategies Frédéric Masset Universidad Nacional Autónoma de México (UNAM) Pablo Benítez-Llambay (UC, Argentina & NBI Copenhagen), David Velasco (UNAM
More informationMaximum-weighted matching strategies and the application to symmetric indefinite systems
Maximum-weighted matching strategies and the application to symmetric indefinite systems by Stefan Röllin, and Olaf Schenk 2 Technical Report CS-24-7 Department of Computer Science, University of Basel
More informationNotes for CS542G (Iterative Solvers for Linear Systems)
Notes for CS542G (Iterative Solvers for Linear Systems) Robert Bridson November 20, 2007 1 The Basics We re now looking at efficient ways to solve the linear system of equations Ax = b where in this course,
More informationAn Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor
An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor Christian Lessig Abstract The Algorithm of Multiple Relatively Robust Representations (MRRR) is one of the most efficient and accurate
More informationA block cipher enciphers each block with the same key.
Ciphers are classified as block or stream ciphers. All ciphers split long messages into blocks and encipher each block separately. Block sizes range from one bit to thousands of bits per block. A block
More informationA robust multilevel approximate inverse preconditioner for symmetric positive definite matrices
DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric
More information(Refer Slide Time: 03: 09)
Computational Electromagnetics and Applications Professor Krish Sankaran Indian Institute of Technology Bombay Lecture No 26 Finite Volume Time Domain Method-I Welcome back in the precious lectures we
More informationarxiv: v1 [cond-mat.str-el] 22 Jun 2007
Optimized implementation of the Lanczos method for magnetic systems arxiv:0706.3293v1 [cond-mat.str-el] 22 Jun 2007 Jürgen Schnack a, a Universität Bielefeld, Fakultät für Physik, Postfach 100131, D-33501
More informationParallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and
More informationDynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationMicromagnetic simulation of dynamic and thermal effects
Micromagnetic simulation of dynamic and thermal effects T. Schrefl, J. Fidler, D. Suess, W. Scholz, V. Tsiantos Institute of Applied and Technical Physics Vienna University of Technology Wiedner Haupstr.
More informationA Sparse QS-Decomposition for Large Sparse Linear System of Equations
A Sparse QS-Decomposition for Large Sparse Linear System of Equations Wujian Peng 1 and Biswa N. Datta 2 1 Department of Math, Zhaoqing University, Zhaoqing, China, douglas peng@yahoo.com 2 Department
More informationMicromagnetic simulation of magnetization reversal in rotational magnetic fields
Physica B 306 (2001) 112 116 Micromagnetic simulation of magnetization reversal in rotational magnetic fields J. Fidler*, T. Schrefl, W. Scholz, D. Suess, V.D. Tsiantos Institute of Applied and Technical
More informationAccelerating Quantum Chromodynamics Calculations with GPUs
Accelerating Quantum Chromodynamics Calculations with GPUs Guochun Shi, Steven Gottlieb, Aaron Torok, Volodymyr Kindratenko NCSA & Indiana University National Center for Supercomputing Applications University
More informationV High frequency magnetic measurements
V High frequency magnetic measurements Rémy Lassalle-Balier What we are doing and why Ferromagnetic resonance CHIMP memory Time-resolved magneto-optic Kerr effect NISE Task 8 New materials Spin dynamics
More informationarxiv: v1 [physics.comp-ph] 30 Oct 2017
An efficient GPU algorithm for tetrahedron-based Brillouin-zone integration Daniel Guterding 1, and Harald O. Jeschke 1 Lucht Probst Associates, Große Gallusstraße 9, 011 Frankfurt am Main, Germany, European
More informationPractical Combustion Kinetics with CUDA
Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides
More informationMultilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota
Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationAn Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor
An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor Christian Lessig Abstract The Algorithm of Multiple Relatively Robust Representations (MRRRR) is one of the most efficient and most
More informationSparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations
Sparse Linear Systems Iterative Methods for Sparse Linear Systems Matrix Computations and Applications, Lecture C11 Fredrik Bengzon, Robert Söderlund We consider the problem of solving the linear system
More informationNotation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing
Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE
More informationParallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29
Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing
More informationJulian Merten. GPU Computing and Alternative Architecture
Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg
More informationMacroscopic properties II
Paolo Allia DISAT Politecnico di Torino acroscopic properties II acroscopic properties II Crucial aspects of macroscopic ferromagnetism Crystalline magnetic anisotropy Shape anisotropy Ferromagnetic domains
More informationParallel VLSI CAD Algorithms. Lecture 1 Introduction Zhuo Feng
Parallel VLSI CAD Algorithms Lecture 1 Introduction Zhuo Feng 1.1 Prof. Zhuo Feng Office: EERC 513 Phone: 487-3116 Email: zhuofeng@mtu.edu Class Website http://www.ece.mtu.edu/~zhuofeng/ee5900spring2012.html
More informationCRYPTOGRAPHIC COMPUTING
CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,
More informationElectromagnetic Counterparts to Gravitational Wave Detections: Bridging the Gap between Theory and Observation
Electromagnetic Counterparts to Gravitational Wave Detections: Bridging the Gap between Theory and Observation Prof. Zach Etienne, West Virginia University 4 km General Relativity, Briefly Special Relativity:
More informationMassively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling
2019 Intel extreme Performance Users Group (IXPUG) meeting Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr)
More informationPerformance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures
Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse
More informationMassively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem
Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing
More information4th year Project demo presentation
4th year Project demo presentation Colm Ó héigeartaigh CASE4-99387212 coheig-case4@computing.dcu.ie 4th year Project demo presentation p. 1/23 Table of Contents An Introduction to Quantum Computing The
More informationTransposition Mechanism for Sparse Matrices on Vector Processors
Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands
More informationRECENT ADVANCES IN SIMULATION OF MAGNETRONS AND CROSSED-FIELD AMPLIFIERS. Abstract
RECENT ADVANCES IN SIMULATION OF MAGNETRONS AND CROSSED-FIELD AMPLIFIERS George E. Dombrowski 69 Birchwood Heights Road Storrs, Connecticut 06268 (860)429-2478 Abstract Various improvements and enhancements
More informationRecent advances in sparse linear solver technology for semiconductor device simulation matrices
Recent advances in sparse linear solver technology for semiconductor device simulation matrices (Invited Paper) Olaf Schenk and Michael Hagemann Department of Computer Science University of Basel Basel,
More informationTwo case studies of Monte Carlo simulation on GPU
Two case studies of Monte Carlo simulation on GPU National Institute for Computational Sciences University of Tennessee Seminar series on HPC, Feb. 27, 2014 Outline 1 Introduction 2 Discrete energy lattice
More informationACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS
More informationTargeting Extreme Scale Computational Challenges with Heterogeneous Systems
Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1 Introduction! PNNL Laboratory Directed Research &
More informationAN INDEPENDENT LOOPS SEARCH ALGORITHM FOR SOLVING INDUCTIVE PEEC LARGE PROBLEMS
Progress In Electromagnetics Research M, Vol. 23, 53 63, 2012 AN INDEPENDENT LOOPS SEARCH ALGORITHM FOR SOLVING INDUCTIVE PEEC LARGE PROBLEMS T.-S. Nguyen *, J.-M. Guichon, O. Chadebec, G. Meunier, and
More informationCurrent-Induced Domain-Wall Dynamics in Ferromagnetic Nanowires
Current-Induced Domain-Wall Dynamics in Ferromagnetic Nanowires Benjamin Krüger 17.11.2006 1 Model The Micromagnetic Model Current Induced Magnetisation Dynamics Phenomenological Description Experimental
More informationMAA507, Power method, QR-method and sparse matrix representation.
,, and representation. February 11, 2014 Lecture 7: Overview, Today we will look at:.. If time: A look at representation and fill in. Why do we need numerical s? I think everyone have seen how time consuming
More informationTransition from single-domain to vortex state in soft magnetic cylindrical nanodots
Transition from single-domain to vortex state in soft magnetic cylindrical nanodots W. Scholz 1,2, K. Yu. Guslienko 2, V. Novosad 3, D. Suess 1, T. Schrefl 1, R. W. Chantrell 2 and J. Fidler 1 1 Vienna
More informationScientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix
Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008
More informationarxiv: v1 [physics.comp-ph] 22 Nov 2012
A Customized 3D GPU Poisson Solver for Free BCs Nazim Dugan a, Luigi Genovese b, Stefan Goedecker a, a Department of Physics, University of Basel, Klingelbergstr. 82, 4056 Basel, Switzerland b Laboratoire
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationTowards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters
Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM
More informationPerformance Analysis of Lattice QCD Application with APGAS Programming Model
Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models
More informationIntroduction to Quantum Information Technologies. B.M. Terhal, JARA-IQI, RWTH Aachen University & Forschungszentrum Jülich
Introduction to Quantum Information Technologies B.M. Terhal, JARA-IQI, RWTH Aachen University & Forschungszentrum Jülich How long can we store a bit Hieroglyphs carved in sandstone at the Luxor Temple
More informationMicromagnetic simulations of magnetization reversal. in Co/Ni multilayers
16 May 2001 Micromagnetic simulations of magnetization reversal in Co/Ni multilayers V. D. Tsiantos a, T. Schrefl a, D. Suess a, W. Scholz a, J. Fidler a, and J. M. Gonzales b a Vienna University of Technology,
More informationOptimizing Time Integration of Chemical-Kinetic Networks for Speed and Accuracy
Paper # 070RK-0363 Topic: Reaction Kinetics 8 th U. S. National Combustion Meeting Organized by the Western States Section of the Combustion Institute and hosted by the University of Utah May 19-22, 2013
More informationMassively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling
2019 Intel extreme Performance Users Group (IXPUG) meeting Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr)
More informationFINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION
FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros
More informationParallelization of the QC-lib Quantum Computer Simulator Library
Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/
More informationFinite Difference Methods (FDMs) 1
Finite Difference Methods (FDMs) 1 1 st - order Approxima9on Recall Taylor series expansion: Forward difference: Backward difference: Central difference: 2 nd - order Approxima9on Forward difference: Backward
More informationCalculating Frobenius Numbers with Boolean Toeplitz Matrix Multiplication
Calculating Frobenius Numbers with Boolean Toeplitz Matrix Multiplication For Dr. Cull, CS 523, March 17, 2009 Christopher Bogart bogart@eecs.oregonstate.edu ABSTRACT I consider a class of algorithms that
More informationThe Solution of a FEM Equation in Frequency Domain Using a Parallel Computing with CUBLAS
The Solution of a FEM Equation in Frequency Domain Using a Parallel Computing with CUBLAS R. Dominguez 1, A. Medina 1, and A. Ramos-Paz 1 1 Facultad de Ingeniería Eléctrica, División de Estudios de Posgrado,
More informationFINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING
FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING Daniel Thuerck 1,2 (advisors Michael Goesele 1,2 and Marc Pfetsch 1 ) Maxim Naumov 3 1 Graduate School of Computational Engineering, TU Darmstadt
More informationHydra. A. Augusto Alves Jr and M.D. Sokoloff. University of Cincinnati
Hydra A library for data analysis in massively parallel platforms A. Augusto Alves Jr and M.D. Sokoloff University of Cincinnati aalvesju@cern.ch Presented at the Workshop Perspectives of GPU computing
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationImage Reconstruction And Poisson s equation
Chapter 1, p. 1/58 Image Reconstruction And Poisson s equation School of Engineering Sciences Parallel s for Large-Scale Problems I Chapter 1, p. 2/58 Outline 1 2 3 4 Chapter 1, p. 3/58 Question What have
More informationIntroduction. HFSS 3D EM Analysis S-parameter. Q3D R/L/C/G Extraction Model. magnitude [db] Frequency [GHz] S11 S21 -30
ANSOFT Q3D TRANING Introduction HFSS 3D EM Analysis S-parameter Q3D R/L/C/G Extraction Model 0-5 -10 magnitude [db] -15-20 -25-30 S11 S21-35 0 1 2 3 4 5 6 7 8 9 10 Frequency [GHz] Quasi-static or full-wave
More informationParallelization of the QC-lib Quantum Computer Simulator Library
Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers
More informationUnidirectional spin-wave heat conveyer
Unidirectional spin-wave heat conveyer Figure S1: Calculation of spin-wave modes and their dispersion relations excited in a 0.4 mm-thick and 4 mm-diameter Y 3 Fe 5 O 12 disk. a, Experimentally obtained
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationAPPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn.
APPARC PaA3a Deliverable ESPRIT BRA III Contract # 6634 Reordering of Sparse Matrices for Parallel Processing Achim Basermannn Peter Weidner Zentralinstitut fur Angewandte Mathematik KFA Julich GmbH D-52425
More informationProblem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26
Binary Search Introduction Problem Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26 Strategy 1: Random Search Randomly select a page until the page containing
More informationFINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION
FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros
More informationJanus: FPGA Based System for Scientific Computing Filippo Mantovani
Janus: FPGA Based System for Scientific Computing Filippo Mantovani Physics Department Università degli Studi di Ferrara Ferrara, 28/09/2009 Overview: 1. The physical problem: - Ising model and Spin Glass
More informationOverview: Synchronous Computations
Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More information6. Iterative Methods for Linear Systems. The stepwise approach to the solution...
6 Iterative Methods for Linear Systems The stepwise approach to the solution Miriam Mehl: 6 Iterative Methods for Linear Systems The stepwise approach to the solution, January 18, 2013 1 61 Large Sparse
More information