S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

Size: px
Start display at page:

Download "S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems"

Transcription

1 S Subdivide, : Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems Elmar Westphal - Forschungszentrum Jülich GmbH 1

2 Contents Micromagnetism TetraMag, a FEM/BEM Micromagnetism Simulator Porting TetraMag to CUDA Porting TetraMag to multi-gpu CUDA Benchmarks 2

3 Micromagnetism In Micromagnetism, we investigate The structure of ferromagnetic domains. The structure, dynamics and motion of domain walls. Structure and dynamics of magnetic vortices. Spin waves, etc As a mesoscopic theory, it can provide a link between simulation and experiment 3

4 Magnetism on Different Length Scales magnetic nanostructures Macroscopic models hysteresis models, response to external parameters Domain theory subdivision into domains, details of the magnetic structure are neglected Micromagnetism continuum theory, domain walls, vortices Quantum theory electronic structure 4 Heisenberg model atomistic effects, spin chains

5 Magnetism on Different Time Scales 5

6 Most Recent Achievement Discovery of the Spin Cherenkov Effect [1] (magnetism equivalent to the sonic boom) Geometry: - 2 µm x 1 µm x 1 µm Permalloy prism - 5nm resolution (100 million tetrahedrons, 16 million discretisation nodes) [1] M. Yan, A. Kákay, C. Andreas, & R. Hertel. Spin Cherenkov effect and magnonic Mach cones. Physical Review B, Rapid Communications, 88, (R) (2013) 6

7 About TetraMag Code started by Riccardo Hertel [1], extended and ported by Attila Kakay, Elmar Westphal [2][3] Upcoming: details about Calculation steps Matrix properties Older CUDA versions New challenges [1] Hertel, R. (2001). Micromagnetic simulations of magnetostatically coupled Nickel nanowires. Journal of Applied Physics, 90(11), [2] Kakay, A., Westphal, E., & Hertel, R. (2010). Speedup of FEM micromagnetic simulations with graphical processing units. Magnetics, IEEE Transactions on, 46(6), [3] MagnetizationDynamics/Simulations/_node.html 7

8 Calculation Steps Calculate magnetostatic fields with scalar potential Split into two parts U = U1 + U2 U1 is the solution of the inhomogenous Neumann problem U2 is to satisfy Laplace s equation with Dirichlet boundary conditions Solve/Integrate the Landau-Lifshitz-Gilbert equation of motion 8

9 Magnetostatic Field Calculated in 3 steps: Iterative solution of a sparse linear system for U1 Dense/hierarchical matrix-vector multiplication to obtain U2 in the boundary regions (not covered in this talk) Iterative solution of a sparse linear system for U2 within the magnetic region Sparse systems are solved using multi-gpu linbcg and bicgstab solvers 9

10 Time Integration Includes field calculations and vector operations Performed using CVODE from the Sundials package CVODE s NVector -structure and its operations have been ported for CUDA on single-host-multi-gpu systems Memory consuming (~1KB/node, limiting factor for GPU usage) Field calculations use a sparse matrix and several field vectors CVODE internally needs many helper vectors 10

11 Properties of TetraMag s Sparse Matrices Contain ~15 non-zero elements per row/column Distribution of elements depends on the underlying mesh (regular to seemingly random) Symmetric (for magnetostatic field calculation) OR Asymmetric (for exchange field calculation) Static over the whole program run 11

12 TetraMagCUDA Around since ~2009 (see GTC 2010 poster[1]) CUDA parts were piggy-backed onto CPU-routines Tries to copy as many sparse matrices to device memory as possible, the remainder stays on the CPU GPU-only execution limited to problem sizes of ~1M nodes Sufficient for most use cases at the time 12 [1] /Q02-Massively-Parallel-Micromagnetic-FEM- Calculations-with-Graphical-Processing-Units.pdf

13 New Challenges Simulations at experimental scale (µm) require larger scale simulations (10M+ nodes) Single matrix + solver/integrator-vectors often exceed device memory Copying matrices sequentially not sufficient, Possible solutions: Reduce memory footprint Distribute problem over multiple GPUs 13

14 Reduce Memory Footprint Use symmetry Effective, but not possible for all matrices Often needs atomic operations (slow) Reduce precision of off-diagonal elements Moderately effective (8+4 -> 4+4 bit/value) May lead to unacceptable loss of precision 14

15 Reduce Memory Footprint Extract diagonals Good for performance (coalesced access) Very good results for symmetric matrices based on regular meshes (2 x (8+4) -> 8 bit) Mixed results otherwise Combinations of some/all of the above 15

16 Preprocessing Approaches for Matrix Distribution Matrices can be preprocessed for GPU(s) in different ways Traditional single GPU calculation Naive multi-gpu distribution checkerboard-style distribution 16

17 Traditional MVM on a Single GPU X = X

18 Traditional MVM on a Single GPU X = X

19 Distribution over Multiple GPUs Naive approach: Divide matrix into N GPU sub-matrices with N row /N GPU rows Copy one sub-matrix to each GPU Copy vector to all GPUs Perform partial multiplications Copy partial results to all other GPUs Repeat (if needed) 17

20 Naive Distribution, Preparation and First Multiplication X = 18

21 Naive Distribution, Preparation and First Multiplication X = copy partial matrices to other GPUs 18

22 Naive Distribution, Preparation and First Multiplication X = copy partial matrices to other GPUs 18 copy vectors to other GPUs

23 Naive Distribution, Preparation and First Multiplication X = copy partial matrices to other GPUs 18 copy vectors to other GPUs calculate

24 Naive Distribution, Subsequent Multiplications X = 19

25 Naive Distribution, Subsequent Multiplications X = 19 copy vectors to other GPUs

26 Naive Distribution, Subsequent Multiplications X = 19 copy vectors to other GPUs calculate

27 Naive Distribution Approach Pro Easy to implement Con All sub-matrices need vector data from all GPUs at the beginning of calculation Data transfer overhead about as expensive as actual calculation -> performance often below single GPU solution 20

28 Checkerboard Approach Split each sub-matrix into N GPU sub-sub-matrices Each of these needs vector values from only one GPU Perform multiplication of first sub-sub-matrix with partial vector At the same time, copy vector part needed for next multiplication into a (double)-buffer in a different stream Repeat for other sub-sub matrices 21

29 Checkerboard -Style MVM Partial matrix on GPU X = needs vector data from 22 part. vectors double part. results buffer

30 Checkerboard -Style MVM Partial matrix on GPU X = needs vector data from 22 part. vectors double part. results buffer

31 Checkerboard -Style MVM Partial matrix on GPU X = needs vector data from 22 part. vectors double part. results buffer

32 Checkerboard -Style MVM Partial matrix on GPU X = needs vector data from 22 part. vectors double part. results buffer

33 Checkerboard -Style MVM Partial matrix on GPU X = needs vector data from 22 part. vectors double part. results buffer

34 Matrix Preprocessing Original Matrix in CSR-like format Blocks of Nwarpsize rows are transposed to enable coalesced memory access Distribution of data destroys uniformity of row lengths Zero padding may be necessary -> wasted memory Rows are sorted and re-indexed by number of nonzero elements -> minimal padding 23

35 Optimizing Data Transfers Depending on the problem, sub-sub-matrices can get very small and access only very few vector elements Multiplication time is short Transfer of potentially unneeded elements takes a much longer time Solution: transfer only those elements that are really needed 24

36 Creating Export Vectors Each GPU needs an individual set of vector elements from every other GPU: Approach one: create NGPU x (NGPU-1) export vectors consumes much memory Approach two: rewrite export vector NGPU x (NGPU-1) times May need to read/write all elements NGPU x (NGPU-1) times Approach three: do some (lots of) work in preprocessing 25

37 Building the Perfect(?) Set of Export Vector During preprocessing: Build a key value out of which other GPUs need a certain vector element Using one bit for each target GPU results in at most 2 N GPU-1 key values (usually NGPU<=8 in a node) Use values as sort key, re-index in sorted order Store block offsets for keys 26

38 Building the Perfect(?) Set of Export Vector (cont.) During multiplication: Build export vector according to the index generated in preprocessing During loop over all sub-sub-matrices: Loop over all export blocks Asynchronously copy blocks with elements needed from next GPU into buffer 27

39 Key values are calculated relative to originating GPU: GPU

40 Key values are calculated relative to originating GPU: GPU vector elements accessed in cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100)

41 Key values are calculated relative to originating GPU: GPU vector elements accessed in cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100) =4

42 Key values are calculated relative to originating GPU: GPU vector elements accessed in cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100) = =3

43 Key values are calculated relative to originating GPU: GPU vector elements accessed in cols 0-2 are originally on GPU 0, relative key binary values: 0 1(001) 2(010) 4(100) = =3 cols 6-8 are originally on GPU 2, relative key binary values: 2(010) 4(100) 0 1(001) =5

44 build export buffer during preprocessing: keys from matrix for GPU 0 index ref. key

45 build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index ref. key export ref. key

46 build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index ref. key export ref. key binary

47 build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index ref. key export ref. key binary blocks sent to export streams during multiplication loop: 29

48 build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index ref. key export ref. key binary blocks sent to export streams during multiplication loop: exported to GPU 1 (bit 001) index binary

49 build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index ref. key export ref. key binary blocks sent to export streams during multiplication loop: exported to GPU 1 (bit 001) index binary exported to GPU 2 (bit 010) index binary

50 build export buffer during preprocessing: keys from matrix for GPU 0 reorder to resulting export index index ref. key export ref. key binary blocks sent to export streams during multiplication loop: exported to GPU 1 (bit 001) index binary exported to GPU 2 (bit 010) exported to GPU 3 (bit 100) index binary index binary

51 Pros and Cons Pro: Export vector is only built once per multiplication No element needs to be stored more than once No element is transferred more often than necessary Con: Limited number of GPUs, because number of blocks grows exponentially Time-consuming pre-processing 30

52 Further Optimisations Sub-sub-matrix multiplications are ordered by size of matrix (number of non-zero elements) More likely to correlate larger transfers with longer calculations Export vectors are copied to host memory during initial calculations Allows parallel import on devices with only one copy engine 31

53 Preprocessing Time Preprocessing involves multiple expensive indexing and sorting steps May take up to seconds per matrix (with n ~20M) Depends on the number of GPUs used Happens only once, because matrices are static Typical run includes millions of solver iterations/matrix multiplications 32

54 Benchmarks Solver iteration times for Cube with regular mesh, very diagonal-dominant matrices, little data transfer Two problems of similar size and nature: Round disk with irregular mesh structure, few extractable diagonals, lots of data transfer Round disk with partially regular mesh structure, some extractable diagonals, moderate data transfer 33

55 Time per solver integration for cube 1µm (8.1M nodes, regular mesh) 30 Comparison: 2 x E (10 cores, 2.8 GHz): 134 ms 25 time per solver iteration [ms] number of GPUs Titan GTX

56 Time per solver integration for disk (6.x M nodes, different mesh structures) 40 Comparison: 2 x E (10 cores, 2.8 GHz): 100 ms reg. 118 ms irreg. time per solver iteration [ms] number of GPUs Titan irreg.mesh GTX 690, irreg.mesh GTX 690, 1GPU/card, irreg.mesh 35 Titan reg.mesh GTX 690 reg.mesh

57 Conclusions Distribution of matrices and vectors over multiple GPUs allows us to simulate significantly larger samples Performance scaling depends largely on the amount of data exchanged between GPUs Optimising the mesh-structure is very important in multi-gpu setups 36

58 Questions? 37

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

Explore Computational Power of GPU in Electromagnetics and Micromagnetics Explore Computational Power of GPU in Electromagnetics and Micromagnetics Presenter: Sidi Fu, PhD candidate, UC San Diego Advisor: Prof. Vitaliy Lomakin Center of Magnetic Recording Research, Department

More information

A Hybrid Method for the Wave Equation. beilina

A Hybrid Method for the Wave Equation.   beilina A Hybrid Method for the Wave Equation http://www.math.unibas.ch/ beilina 1 The mathematical model The model problem is the wave equation 2 u t 2 = (a 2 u) + f, x Ω R 3, t > 0, (1) u(x, 0) = 0, x Ω, (2)

More information

A CUDA Solver for Helmholtz Equation

A CUDA Solver for Helmholtz Equation Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College

More information

Coupling atomistic and continuum modelling of magnetism

Coupling atomistic and continuum modelling of magnetism Coupling atomistic and continuum modelling of magnetism M. Poluektov 1,2 G. Kreiss 2 O. Eriksson 3 1 University of Warwick WMG International Institute for Nanocomposites Manufacturing 2 Uppsala University

More information

SOLUTION of linear systems of equations of the form:

SOLUTION of linear systems of equations of the form: Proceedings of the Federated Conference on Computer Science and Information Systems pp. Mixed precision iterative refinement techniques for the WZ factorization Beata Bylina Jarosław Bylina Institute of

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline

More information

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,

More information

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

H ψ = E ψ. Introduction to Exact Diagonalization. Andreas Läuchli, New states of quantum matter MPI für Physik komplexer Systeme - Dresden

H ψ = E ψ. Introduction to Exact Diagonalization. Andreas Läuchli, New states of quantum matter MPI für Physik komplexer Systeme - Dresden H ψ = E ψ Introduction to Exact Diagonalization Andreas Läuchli, New states of quantum matter MPI für Physik komplexer Systeme - Dresden http://www.pks.mpg.de/~aml laeuchli@comp-phys.org Simulations of

More information

Finite element micromagnetics

Finite element micromagnetics Finite element micromagnetics Thomas Schrefl, Dieter Suess, Werner Scholz, Hermann Forster, Vassilios Tsiantos, and Josef Fidler Vienna University of Technology, Institute of Applied and Technical Physics,

More information

Numerical Analysis Fall. Gauss Elimination

Numerical Analysis Fall. Gauss Elimination Numerical Analysis 2015 Fall Gauss Elimination Solving systems m g g m m g x x x k k k k k k k k k 3 2 1 3 2 1 3 3 3 2 3 2 2 2 1 0 0 Graphical Method For small sets of simultaneous equations, graphing

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Proposal for a standard micromagnetic problem: Spin wave dispersion in a magnonic waveguide

Proposal for a standard micromagnetic problem: Spin wave dispersion in a magnonic waveguide 1 Proposal for a standard micromagnetic problem: Spin wave dispersion in a magnonic waveguide G. Venkat, D. Kumar, M. Franchin, O. Dmytriiev, M. Mruczkiewicz, H. Fangohr, A. Barman, M. Krawczyk, and A.

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),

More information

Matrix Assembly in FEA

Matrix Assembly in FEA Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD XVIII International Conference on Water Resources CMWR 2010 J. Carrera (Ed) c CIMNE, Barcelona, 2010 COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD James.E. McClure, Jan F. Prins

More information

Quantum Lattice Models & Introduction to Exact Diagonalization

Quantum Lattice Models & Introduction to Exact Diagonalization Quantum Lattice Models & Introduction to Exact Diagonalization H! = E! Andreas Läuchli IRRMA EPF Lausanne ALPS User Workshop CSCS Manno, 28/9/2004 Outline of this lecture: Quantum Lattice Models Lattices

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

The GPU code FARGO3D: presentation and implementation strategies

The GPU code FARGO3D: presentation and implementation strategies The GPU code FARGO3D: presentation and implementation strategies Frédéric Masset Universidad Nacional Autónoma de México (UNAM) Pablo Benítez-Llambay (UC, Argentina & NBI Copenhagen), David Velasco (UNAM

More information

Maximum-weighted matching strategies and the application to symmetric indefinite systems

Maximum-weighted matching strategies and the application to symmetric indefinite systems Maximum-weighted matching strategies and the application to symmetric indefinite systems by Stefan Röllin, and Olaf Schenk 2 Technical Report CS-24-7 Department of Computer Science, University of Basel

More information

Notes for CS542G (Iterative Solvers for Linear Systems)

Notes for CS542G (Iterative Solvers for Linear Systems) Notes for CS542G (Iterative Solvers for Linear Systems) Robert Bridson November 20, 2007 1 The Basics We re now looking at efficient ways to solve the linear system of equations Ax = b where in this course,

More information

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor Christian Lessig Abstract The Algorithm of Multiple Relatively Robust Representations (MRRR) is one of the most efficient and accurate

More information

A block cipher enciphers each block with the same key.

A block cipher enciphers each block with the same key. Ciphers are classified as block or stream ciphers. All ciphers split long messages into blocks and encipher each block separately. Block sizes range from one bit to thousands of bits per block. A block

More information

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric

More information

(Refer Slide Time: 03: 09)

(Refer Slide Time: 03: 09) Computational Electromagnetics and Applications Professor Krish Sankaran Indian Institute of Technology Bombay Lecture No 26 Finite Volume Time Domain Method-I Welcome back in the precious lectures we

More information

arxiv: v1 [cond-mat.str-el] 22 Jun 2007

arxiv: v1 [cond-mat.str-el] 22 Jun 2007 Optimized implementation of the Lanczos method for magnetic systems arxiv:0706.3293v1 [cond-mat.str-el] 22 Jun 2007 Jürgen Schnack a, a Universität Bielefeld, Fakultät für Physik, Postfach 100131, D-33501

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Micromagnetic simulation of dynamic and thermal effects

Micromagnetic simulation of dynamic and thermal effects Micromagnetic simulation of dynamic and thermal effects T. Schrefl, J. Fidler, D. Suess, W. Scholz, V. Tsiantos Institute of Applied and Technical Physics Vienna University of Technology Wiedner Haupstr.

More information

A Sparse QS-Decomposition for Large Sparse Linear System of Equations

A Sparse QS-Decomposition for Large Sparse Linear System of Equations A Sparse QS-Decomposition for Large Sparse Linear System of Equations Wujian Peng 1 and Biswa N. Datta 2 1 Department of Math, Zhaoqing University, Zhaoqing, China, douglas peng@yahoo.com 2 Department

More information

Micromagnetic simulation of magnetization reversal in rotational magnetic fields

Micromagnetic simulation of magnetization reversal in rotational magnetic fields Physica B 306 (2001) 112 116 Micromagnetic simulation of magnetization reversal in rotational magnetic fields J. Fidler*, T. Schrefl, W. Scholz, D. Suess, V.D. Tsiantos Institute of Applied and Technical

More information

Accelerating Quantum Chromodynamics Calculations with GPUs

Accelerating Quantum Chromodynamics Calculations with GPUs Accelerating Quantum Chromodynamics Calculations with GPUs Guochun Shi, Steven Gottlieb, Aaron Torok, Volodymyr Kindratenko NCSA & Indiana University National Center for Supercomputing Applications University

More information

V High frequency magnetic measurements

V High frequency magnetic measurements V High frequency magnetic measurements Rémy Lassalle-Balier What we are doing and why Ferromagnetic resonance CHIMP memory Time-resolved magneto-optic Kerr effect NISE Task 8 New materials Spin dynamics

More information

arxiv: v1 [physics.comp-ph] 30 Oct 2017

arxiv: v1 [physics.comp-ph] 30 Oct 2017 An efficient GPU algorithm for tetrahedron-based Brillouin-zone integration Daniel Guterding 1, and Harald O. Jeschke 1 Lucht Probst Associates, Große Gallusstraße 9, 011 Frankfurt am Main, Germany, European

More information

Practical Combustion Kinetics with CUDA

Practical Combustion Kinetics with CUDA Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides

More information

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor Christian Lessig Abstract The Algorithm of Multiple Relatively Robust Representations (MRRRR) is one of the most efficient and most

More information

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations Sparse Linear Systems Iterative Methods for Sparse Linear Systems Matrix Computations and Applications, Lecture C11 Fredrik Bengzon, Robert Söderlund We consider the problem of solving the linear system

More information

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE

More information

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

Macroscopic properties II

Macroscopic properties II Paolo Allia DISAT Politecnico di Torino acroscopic properties II acroscopic properties II Crucial aspects of macroscopic ferromagnetism Crystalline magnetic anisotropy Shape anisotropy Ferromagnetic domains

More information

Parallel VLSI CAD Algorithms. Lecture 1 Introduction Zhuo Feng

Parallel VLSI CAD Algorithms. Lecture 1 Introduction Zhuo Feng Parallel VLSI CAD Algorithms Lecture 1 Introduction Zhuo Feng 1.1 Prof. Zhuo Feng Office: EERC 513 Phone: 487-3116 Email: zhuofeng@mtu.edu Class Website http://www.ece.mtu.edu/~zhuofeng/ee5900spring2012.html

More information

CRYPTOGRAPHIC COMPUTING

CRYPTOGRAPHIC COMPUTING CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,

More information

Electromagnetic Counterparts to Gravitational Wave Detections: Bridging the Gap between Theory and Observation

Electromagnetic Counterparts to Gravitational Wave Detections: Bridging the Gap between Theory and Observation Electromagnetic Counterparts to Gravitational Wave Detections: Bridging the Gap between Theory and Observation Prof. Zach Etienne, West Virginia University 4 km General Relativity, Briefly Special Relativity:

More information

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling 2019 Intel extreme Performance Users Group (IXPUG) meeting Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr)

More information

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse

More information

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing

More information

4th year Project demo presentation

4th year Project demo presentation 4th year Project demo presentation Colm Ó héigeartaigh CASE4-99387212 coheig-case4@computing.dcu.ie 4th year Project demo presentation p. 1/23 Table of Contents An Introduction to Quantum Computing The

More information

Transposition Mechanism for Sparse Matrices on Vector Processors

Transposition Mechanism for Sparse Matrices on Vector Processors Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands

More information

RECENT ADVANCES IN SIMULATION OF MAGNETRONS AND CROSSED-FIELD AMPLIFIERS. Abstract

RECENT ADVANCES IN SIMULATION OF MAGNETRONS AND CROSSED-FIELD AMPLIFIERS. Abstract RECENT ADVANCES IN SIMULATION OF MAGNETRONS AND CROSSED-FIELD AMPLIFIERS George E. Dombrowski 69 Birchwood Heights Road Storrs, Connecticut 06268 (860)429-2478 Abstract Various improvements and enhancements

More information

Recent advances in sparse linear solver technology for semiconductor device simulation matrices

Recent advances in sparse linear solver technology for semiconductor device simulation matrices Recent advances in sparse linear solver technology for semiconductor device simulation matrices (Invited Paper) Olaf Schenk and Michael Hagemann Department of Computer Science University of Basel Basel,

More information

Two case studies of Monte Carlo simulation on GPU

Two case studies of Monte Carlo simulation on GPU Two case studies of Monte Carlo simulation on GPU National Institute for Computational Sciences University of Tennessee Seminar series on HPC, Feb. 27, 2014 Outline 1 Introduction 2 Discrete energy lattice

More information

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS

More information

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1 Introduction! PNNL Laboratory Directed Research &

More information

AN INDEPENDENT LOOPS SEARCH ALGORITHM FOR SOLVING INDUCTIVE PEEC LARGE PROBLEMS

AN INDEPENDENT LOOPS SEARCH ALGORITHM FOR SOLVING INDUCTIVE PEEC LARGE PROBLEMS Progress In Electromagnetics Research M, Vol. 23, 53 63, 2012 AN INDEPENDENT LOOPS SEARCH ALGORITHM FOR SOLVING INDUCTIVE PEEC LARGE PROBLEMS T.-S. Nguyen *, J.-M. Guichon, O. Chadebec, G. Meunier, and

More information

Current-Induced Domain-Wall Dynamics in Ferromagnetic Nanowires

Current-Induced Domain-Wall Dynamics in Ferromagnetic Nanowires Current-Induced Domain-Wall Dynamics in Ferromagnetic Nanowires Benjamin Krüger 17.11.2006 1 Model The Micromagnetic Model Current Induced Magnetisation Dynamics Phenomenological Description Experimental

More information

MAA507, Power method, QR-method and sparse matrix representation.

MAA507, Power method, QR-method and sparse matrix representation. ,, and representation. February 11, 2014 Lecture 7: Overview, Today we will look at:.. If time: A look at representation and fill in. Why do we need numerical s? I think everyone have seen how time consuming

More information

Transition from single-domain to vortex state in soft magnetic cylindrical nanodots

Transition from single-domain to vortex state in soft magnetic cylindrical nanodots Transition from single-domain to vortex state in soft magnetic cylindrical nanodots W. Scholz 1,2, K. Yu. Guslienko 2, V. Novosad 3, D. Suess 1, T. Schrefl 1, R. W. Chantrell 2 and J. Fidler 1 1 Vienna

More information

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008

More information

arxiv: v1 [physics.comp-ph] 22 Nov 2012

arxiv: v1 [physics.comp-ph] 22 Nov 2012 A Customized 3D GPU Poisson Solver for Free BCs Nazim Dugan a, Luigi Genovese b, Stefan Goedecker a, a Department of Physics, University of Basel, Klingelbergstr. 82, 4056 Basel, Switzerland b Laboratoire

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

Introduction to Quantum Information Technologies. B.M. Terhal, JARA-IQI, RWTH Aachen University & Forschungszentrum Jülich

Introduction to Quantum Information Technologies. B.M. Terhal, JARA-IQI, RWTH Aachen University & Forschungszentrum Jülich Introduction to Quantum Information Technologies B.M. Terhal, JARA-IQI, RWTH Aachen University & Forschungszentrum Jülich How long can we store a bit Hieroglyphs carved in sandstone at the Luxor Temple

More information

Micromagnetic simulations of magnetization reversal. in Co/Ni multilayers

Micromagnetic simulations of magnetization reversal. in Co/Ni multilayers 16 May 2001 Micromagnetic simulations of magnetization reversal in Co/Ni multilayers V. D. Tsiantos a, T. Schrefl a, D. Suess a, W. Scholz a, J. Fidler a, and J. M. Gonzales b a Vienna University of Technology,

More information

Optimizing Time Integration of Chemical-Kinetic Networks for Speed and Accuracy

Optimizing Time Integration of Chemical-Kinetic Networks for Speed and Accuracy Paper # 070RK-0363 Topic: Reaction Kinetics 8 th U. S. National Combustion Meeting Organized by the Western States Section of the Combustion Institute and hosted by the University of Utah May 19-22, 2013

More information

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling 2019 Intel extreme Performance Users Group (IXPUG) meeting Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr)

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

Finite Difference Methods (FDMs) 1

Finite Difference Methods (FDMs) 1 Finite Difference Methods (FDMs) 1 1 st - order Approxima9on Recall Taylor series expansion: Forward difference: Backward difference: Central difference: 2 nd - order Approxima9on Forward difference: Backward

More information

Calculating Frobenius Numbers with Boolean Toeplitz Matrix Multiplication

Calculating Frobenius Numbers with Boolean Toeplitz Matrix Multiplication Calculating Frobenius Numbers with Boolean Toeplitz Matrix Multiplication For Dr. Cull, CS 523, March 17, 2009 Christopher Bogart bogart@eecs.oregonstate.edu ABSTRACT I consider a class of algorithms that

More information

The Solution of a FEM Equation in Frequency Domain Using a Parallel Computing with CUBLAS

The Solution of a FEM Equation in Frequency Domain Using a Parallel Computing with CUBLAS The Solution of a FEM Equation in Frequency Domain Using a Parallel Computing with CUBLAS R. Dominguez 1, A. Medina 1, and A. Ramos-Paz 1 1 Facultad de Ingeniería Eléctrica, División de Estudios de Posgrado,

More information

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING Daniel Thuerck 1,2 (advisors Michael Goesele 1,2 and Marc Pfetsch 1 ) Maxim Naumov 3 1 Graduate School of Computational Engineering, TU Darmstadt

More information

Hydra. A. Augusto Alves Jr and M.D. Sokoloff. University of Cincinnati

Hydra. A. Augusto Alves Jr and M.D. Sokoloff. University of Cincinnati Hydra A library for data analysis in massively parallel platforms A. Augusto Alves Jr and M.D. Sokoloff University of Cincinnati aalvesju@cern.ch Presented at the Workshop Perspectives of GPU computing

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Image Reconstruction And Poisson s equation

Image Reconstruction And Poisson s equation Chapter 1, p. 1/58 Image Reconstruction And Poisson s equation School of Engineering Sciences Parallel s for Large-Scale Problems I Chapter 1, p. 2/58 Outline 1 2 3 4 Chapter 1, p. 3/58 Question What have

More information

Introduction. HFSS 3D EM Analysis S-parameter. Q3D R/L/C/G Extraction Model. magnitude [db] Frequency [GHz] S11 S21 -30

Introduction. HFSS 3D EM Analysis S-parameter. Q3D R/L/C/G Extraction Model. magnitude [db] Frequency [GHz] S11 S21 -30 ANSOFT Q3D TRANING Introduction HFSS 3D EM Analysis S-parameter Q3D R/L/C/G Extraction Model 0-5 -10 magnitude [db] -15-20 -25-30 S11 S21-35 0 1 2 3 4 5 6 7 8 9 10 Frequency [GHz] Quasi-static or full-wave

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers

More information

Unidirectional spin-wave heat conveyer

Unidirectional spin-wave heat conveyer Unidirectional spin-wave heat conveyer Figure S1: Calculation of spin-wave modes and their dispersion relations excited in a 0.4 mm-thick and 4 mm-diameter Y 3 Fe 5 O 12 disk. a, Experimentally obtained

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

APPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn.

APPARC PaA3a Deliverable. ESPRIT BRA III Contract # Reordering of Sparse Matrices for Parallel Processing. Achim Basermannn. APPARC PaA3a Deliverable ESPRIT BRA III Contract # 6634 Reordering of Sparse Matrices for Parallel Processing Achim Basermannn Peter Weidner Zentralinstitut fur Angewandte Mathematik KFA Julich GmbH D-52425

More information

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26 Binary Search Introduction Problem Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26 Strategy 1: Random Search Randomly select a page until the page containing

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

Janus: FPGA Based System for Scientific Computing Filippo Mantovani

Janus: FPGA Based System for Scientific Computing Filippo Mantovani Janus: FPGA Based System for Scientific Computing Filippo Mantovani Physics Department Università degli Studi di Ferrara Ferrara, 28/09/2009 Overview: 1. The physical problem: - Ising model and Spin Glass

More information

Overview: Synchronous Computations

Overview: Synchronous Computations Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

6. Iterative Methods for Linear Systems. The stepwise approach to the solution... 6 Iterative Methods for Linear Systems The stepwise approach to the solution Miriam Mehl: 6 Iterative Methods for Linear Systems The stepwise approach to the solution, January 18, 2013 1 61 Large Sparse

More information