An Integrative Model for Parallelism

Size: px
Start display at page:

Download "An Integrative Model for Parallelism"

Transcription

1 An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09

2 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2

3 Introduction tw-12-exascale 2012/01/09 3

4 Practical starting point Data movement is very important: need to model this I m starting with distributed memory. We understand parallelism through MPI (running at 300,000 cores); the problem is one of programmability. Goal To find a high level way of parallel programming, that can be compiled down to what I would have written in MPI high productivity because working in global terms portable performance since the MPI code is derived Pie in the sky Apart from distributed memory, we would like to cover task parallelism, multicore, accelerators tw-12-exascale 2012/01/09 4

5 My claim It is possible to have a global model of parallelism where MPI-type parallelism can be derived through formal reasoning about global objects. The model is versatile in dealing with applications and memory models beyond linear algebra on clusters. Prospects for a high productivity high performance way of parallel programming. No silver bullet: algorithm designer still needs to design data and work distributions. tw-12-exascale 2012/01/09 5

6 Claim in CS terms Given distributions on input, output, instructions: a communication pattern becomes a formally derived first-class object; communication is an instantiation of this object Consequences: communication is properly aggregated data can be sent (asynchronously) when available inspector-executor model where the pattern is derived once, then applied many times. tw-12-exascale 2012/01/09 6

7 Working title: I/MP: Integrative Model for Parallelism tw-12-exascale 2012/01/09 7

8 Formal part tw-12-exascale 2012/01/09 8

9 Theoretical model Computation is bipartite graph from input to output, edges are instructions Edge: A = In, Out, E (α, β) E : α In, β Out is elementary computation between data elements. This models a kernel, (not a whole code), possibly even less: compare supersteps in BSP tw-12-exascale 2012/01/09 9

10 Parallel computation: where A = A 1,..., A P, A p = In p, Out p, E p, In = p In p, Out = p Out p, E = p E p ; Possibly non-disjoint partitions (redundant work) Example. If all In p, Out p identical: shared memory Note explicit distribution of work, independently of data tw-12-exascale 2012/01/09 10

11 Where does communication come from? Define total input/output of a processor: In(E p ) = {α: (α, β) E p }, Out(E p ) = {β : (α, β) E p }. Not the same as In p, Out p! In p In(E p) Out p Out(E p) local part of input total needed input local part of output total computed output Communication: In(E p ) In p is communicated in before computation Communication: Out(E p ) Out p is communicated out after computation Example. Owner computes: Out(E p ) = Out p Example. Embarrassingly parallel: In(E p ) = In p, Out(E p ) = Out p, and everything disjoint By defining the distributions, communication becomes formally derivable tw-12-exascale 2012/01/09 11

12 Object view of communication Analyze all computation edges in terms of p, q source/target processors The sum over all p, q of these is an object describing the total communication in a kernel Compare to a PETSc VecScatter object: describe what goes where, not how. Compare to active messages: communication and computation in one. Communication objects are derivable from the algorithm, they are not defined during the execution of the communication; inspector/executor paradigm tw-12-exascale 2012/01/09 12

13 Examples tw-12-exascale 2012/01/09 13

14 What am I going to show you? The algorithm is mathematics Data distributions are easy to design The communication is derived through formal reasoning Some distributions lead to efficient communication, some don t: we still need a smart programmer. Today I m only showing your math, but I think it s implementable tw-12-exascale 2012/01/09 14

15 Example: Matrix-vector product In p local part of input In(E p) total needed input Out p local part of output Out(E p) total computed output i : y i = j a ij x j. Split computation in temporaries and reduction: i : y i = j a ijx j = j t ij, t ij = a ij x j. Work distribution: t ij is computed where a ij lives We will derive communication for row/col-distributed matrix tw-12-exascale 2012/01/09 15

16 MVP by rows Data: In p = x(i p ) In p In(E p) Out p Out(E p) local part of input total needed input local part of output total computed output work: { E p = {τ ij : i I p }, τ ij := t ij a ij x j derived fact: In(E p ) = x( ) allgather tw-12-exascale 2012/01/09 16

17 MVP by rows Data: In p = t(i p, ) work: { E p = {η i : i I p } η i := y i j t ij derived fact: In p In(E p) Out p Out(E p) In(E p ) = t(i p, ) local operation local part of input total needed input local part of output total computed output tw-12-exascale 2012/01/09 17

18 MVP by columns t ij calculation: In p In(E p) Out p Out(E p) local part of input total needed input local part of output total computed output local In(E p ) = x(i p ), In p = x(i p ) Out(E p ) = t(, I p ) tw-12-exascale 2012/01/09 18

19 MVP by columns reduction: In p In(E p) Out p Out(E p) local part of input total needed input local part of output total computed output In(E p ) = t(i p, ) but now derived: In p = t(, I p ) global transpose or reduce-scatter tw-12-exascale 2012/01/09 19

20 Sparse MVP A little more notation, but no essential difference tw-12-exascale 2012/01/09 20

21 LU solution Derivation of messages is identical, now no longer data parallel communication. VecPipeline instead of VecScatter, dataflow Note: IMP kernels do not imply synchronization: dataflow, supply driven execution Example: one IMP kernel per level in an FFT butterfly operations execute based on availability of data (a bit like DAG scheduling but not entirely) tw-12-exascale 2012/01/09 21

22 Complicated example: n-body problems Tree computation framework according to Katzenelson 1989: The field due to cell i on level l is given by g(l, i) = g(l + 1, j) j C(i) where C(i) denotes the set of children of cell i The field felt by cell i on level l is given by f (l, i) = f (l 1, p(i)) + g(l, j) j I l (i) p(i): the parent cell of i; I l (i): interaction region of i tw-12-exascale 2012/01/09 22

23 Implementation E (g) = E τ E γ, { i j C(i) : i : E τ = {τ l ij}, E γ = {γ l i } τij l = tl ij = g l+1 j γi l = gi l = j C(i) tij l E (f ) = E ρ E σ E φ, E ρ = {ρ l i }, E σ = {σij}, l E φ = {φ l i } i : ρ l i = ri l = f l 1 p(i) i j Il (i) : σij l = sl ij = g j l i : φ l i = fi l = ri l + j I l (i) sl ij tw-12-exascale 2012/01/09 23

24 Extension to other memory models tw-12-exascale 2012/01/09 24

25 Attached accelerators tw-12-exascale 2012/01/09 25

26 Decompose GPU-GPU operations E as e = A 1 e A where A is memory copy from CPU to GPU, so e = A e A 1 is between CPUs: quotient graph. We have derived the MPI communication of an algorithm with distributed GPUs through formal means tw-12-exascale 2012/01/09 26

27 Shared memory multicore Processors still operate on local stores (L1/L2 cache) So it s really distributed memory except with the shared memory as network again factor out physical data movement from logical data movement (this still needs work) tw-12-exascale 2012/01/09 27

28 Conclusion tw-12-exascale 2012/01/09 28

29 Conclusion Integrative Model for Parallelism also: Interface without Message Passing Based on explicit distribution of everything that needs distributing Communication objects derived through formal reasoning Trivially covers distributed memory, can be interpreted for other models Theoretical questions remain: correctness, deadlock, relation to DAG models, interpretation as functional model, relation to BSP, suitability for graph algorithms Practical question: how do you implement this? (Inspiration from PETSc, ParPre, Charm++, elemental) tw-12-exascale 2012/01/09 29

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

High Performance Computing

High Performance Computing Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

Special Nodes for Interface

Special Nodes for Interface fi fi Special Nodes for Interface SW on processors Chip-level HW Board-level HW fi fi C code VHDL VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster

More information

Optimization Techniques for Parallel Code 1. Parallel programming models

Optimization Techniques for Parallel Code 1. Parallel programming models Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of

More information

Lecture 12: Adders, Sequential Circuits

Lecture 12: Adders, Sequential Circuits Lecture 12: Adders, Sequential Circuits Today s topics: Carry-lookahead adder Clocks, latches, sequential circuits 1 Speed of Ripple Carry The carry propagates thru every 1-bit box: each 1-bit box sequentially

More information

Lecture 12: Adders, Sequential Circuits

Lecture 12: Adders, Sequential Circuits Lecture 12: Adders, Sequential Circuits Today s topics: Carry-lookahead adder Clocks, latches, sequential circuits 1 Incorporating beq Perform a b and confirm that the result is all zero s Source: H&P

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

Lecture 13: Sequential Circuits, FSM

Lecture 13: Sequential Circuits, FSM Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines 1 Clocks A microprocessor is composed of many different circuits that are operating simultaneously if each

More information

11 Parallel programming models

11 Parallel programming models 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages

More information

Lecture 7: Dynamic Programming I: Optimal BSTs

Lecture 7: Dynamic Programming I: Optimal BSTs 5-750: Graduate Algorithms February, 06 Lecture 7: Dynamic Programming I: Optimal BSTs Lecturer: David Witmer Scribes: Ellango Jothimurugesan, Ziqiang Feng Overview The basic idea of dynamic programming

More information

Panorama des modèles et outils de programmation parallèle

Panorama des modèles et outils de programmation parallèle Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &

More information

Overview: Synchronous Computations

Overview: Synchronous Computations Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous

More information

Lecture 13: Sequential Circuits

Lecture 13: Sequential Circuits Lecture 13: Sequential Circuits Today s topics: Carry-lookahead adder Clocks and sequential circuits Finite state machines Reminder: Assignment 5 due on Thursday 10/12, mid-term exam Tuesday 10/24 1 Speed

More information

1 / 28. Parallel Programming.

1 / 28. Parallel Programming. 1 / 28 Parallel Programming pauldj@aices.rwth-aachen.de Collective Communication 2 / 28 Barrier Broadcast Reduce Scatter Gather Allgather Reduce-scatter Allreduce Alltoall. References Collective Communication:

More information

Computational complexity theory

Computational complexity theory Computational complexity theory Introduction to computational complexity theory Complexity (computability) theory deals with two aspects: Algorithm s complexity. Problem s complexity. References S. Cook,

More information

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Faster arithmetic for number-theoretic transforms

Faster arithmetic for number-theoretic transforms University of New South Wales 7th October 2011, Macquarie University Plan for talk 1. Review number-theoretic transform (NTT) 2. Discuss typical butterfly algorithm 3. Improvements to butterfly algorithm

More information

B629 project - StreamIt MPI Backend. Nilesh Mahajan

B629 project - StreamIt MPI Backend. Nilesh Mahajan B629 project - StreamIt MPI Backend Nilesh Mahajan March 26, 2013 Abstract StreamIt is a language based on the dataflow model of computation. StreamIt consists of computation units called filters connected

More information

Multicore Semantics and Programming

Multicore Semantics and Programming Multicore Semantics and Programming Peter Sewell Tim Harris University of Cambridge Oracle October November, 2015 p. 1 These Lectures Part 1: Multicore Semantics: the concurrency of multiprocessors and

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Finite-State Model Checking

Finite-State Model Checking EECS 219C: Computer-Aided Verification Intro. to Model Checking: Models and Properties Sanjit A. Seshia EECS, UC Berkeley Finite-State Model Checking G(p X q) Temporal logic q p FSM Model Checker Yes,

More information

Algorithms as multilinear tensor equations

Algorithms as multilinear tensor equations Algorithms as multilinear tensor equations Edgar Solomonik Department of Computer Science ETH Zurich Technische Universität München 18.1.2016 Edgar Solomonik Algorithms as multilinear tensor equations

More information

Computational complexity theory

Computational complexity theory Computational complexity theory Introduction to computational complexity theory Complexity (computability) theory deals with two aspects: Algorithm s complexity. Problem s complexity. References S. Cook,

More information

EEC 281 VLSI Digital Signal Processing Notes on the RRI-FFT

EEC 281 VLSI Digital Signal Processing Notes on the RRI-FFT EEC 281 VLSI Digital Signal Processing Notes on the RRI-FFT Bevan Baas The attached description of the RRI-FFT is taken from a chapter describing the cached-fft algorithm and there are therefore occasional

More information

Matrix Assembly in FEA

Matrix Assembly in FEA Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

Causality and Time. The Happens-Before Relation

Causality and Time. The Happens-Before Relation Causality and Time The Happens-Before Relation Because executions are sequences of events, they induce a total order on all the events It is possible that two events by different processors do not influence

More information

Parallel LU Decomposition (PSC 2.3) Lecture 2.3 Parallel LU

Parallel LU Decomposition (PSC 2.3) Lecture 2.3 Parallel LU Parallel LU Decomposition (PSC 2.3) 1 / 20 Designing a parallel algorithm Main question: how to distribute the data? What data? The matrix A and the permutation π. Data distribution + sequential algorithm

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

Notes on the Matrix-Tree theorem and Cayley s tree enumerator

Notes on the Matrix-Tree theorem and Cayley s tree enumerator Notes on the Matrix-Tree theorem and Cayley s tree enumerator 1 Cayley s tree enumerator Recall that the degree of a vertex in a tree (or in any graph) is the number of edges emanating from it We will

More information

MA554 Assessment 1 Cosets and Lagrange s theorem

MA554 Assessment 1 Cosets and Lagrange s theorem MA554 Assessment 1 Cosets and Lagrange s theorem These are notes on cosets and Lagrange s theorem; they go over some material from the lectures again, and they have some new material it is all examinable,

More information

Algorithms Exam TIN093 /DIT602

Algorithms Exam TIN093 /DIT602 Algorithms Exam TIN093 /DIT602 Course: Algorithms Course code: TIN 093, TIN 092 (CTH), DIT 602 (GU) Date, time: 21st October 2017, 14:00 18:00 Building: SBM Responsible teacher: Peter Damaschke, Tel. 5405

More information

Cosets and Lagrange s theorem

Cosets and Lagrange s theorem Cosets and Lagrange s theorem These are notes on cosets and Lagrange s theorem some of which may already have been lecturer. There are some questions for you included in the text. You should write the

More information

OBEUS. (Object-Based Environment for Urban Simulation) Shareware Version. Itzhak Benenson 1,2, Slava Birfur 1, Vlad Kharbash 1

OBEUS. (Object-Based Environment for Urban Simulation) Shareware Version. Itzhak Benenson 1,2, Slava Birfur 1, Vlad Kharbash 1 OBEUS (Object-Based Environment for Urban Simulation) Shareware Version Yaffo model is based on partition of the area into Voronoi polygons, which correspond to real-world houses; neighborhood relationship

More information

Fast Multipole Methods: Fundamentals & Applications. Ramani Duraiswami Nail A. Gumerov

Fast Multipole Methods: Fundamentals & Applications. Ramani Duraiswami Nail A. Gumerov Fast Multipole Methods: Fundamentals & Applications Ramani Duraiswami Nail A. Gumerov Week 1. Introduction. What are multipole methods and what is this course about. Problems from physics, mathematics,

More information

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The

More information

Binary addition by hand. Adding two bits

Binary addition by hand. Adding two bits Chapter 3 Arithmetic is the most basic thing you can do with a computer We focus on addition, subtraction, multiplication and arithmetic-logic units, or ALUs, which are the heart of CPUs. ALU design Bit

More information

Binary addition example worked out

Binary addition example worked out Binary addition example worked out Some terms are given here Exercise: what are these numbers equivalent to in decimal? The initial carry in is implicitly 0 1 1 1 0 (Carries) 1 0 1 1 (Augend) + 1 1 1 0

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

ECE/CS 250 Computer Architecture

ECE/CS 250 Computer Architecture ECE/CS 250 Computer Architecture Basics of Logic Design: Boolean Algebra, Logic Gates (Combinational Logic) Tyler Bletsch Duke University Slides are derived from work by Daniel J. Sorin (Duke), Alvy Lebeck

More information

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Trinity Christian School Curriculum Guide

Trinity Christian School Curriculum Guide Course Title: Calculus Grade Taught: Twelfth Grade Credits: 1 credit Trinity Christian School Curriculum Guide A. Course Goals: 1. To provide students with a familiarity with the properties of linear,

More information

Using Kernel Couplings to Predict Parallel Application Performance

Using Kernel Couplings to Predict Parallel Application Performance Using Kernel Couplings to Predict Parallel Application Performance Valerie Taylor, Xingfu Wu, Jonathan Geisler Department of Electrical and Computer Engineering, Northwestern University, Evanston IL 60208

More information

SCHUR-WEYL DUALITY FOR U(n)

SCHUR-WEYL DUALITY FOR U(n) SCHUR-WEYL DUALITY FOR U(n) EVAN JENKINS Abstract. These are notes from a lecture given in MATH 26700, Introduction to Representation Theory of Finite Groups, at the University of Chicago in December 2009.

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Oblivious Algorithms for Multicores and Network of Processors

Oblivious Algorithms for Multicores and Network of Processors Oblivious Algorithms for Multicores and Network of Processors Rezaul Alam Chowdhury, Francesco Silvestri, Brandon Blakeley and Vijaya Ramachandran Department of Computer Sciences University of Texas Austin,

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra By: David McQuilling; Jesus Caban Deng Li Jan.,31,006 CS51 Solving Linear Equations u + v = 8 4u + 9v = 1 A x b 4 9 u v = 8 1 Gaussian Elimination Start with the matrix representation

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

1 Introduction. 1.1 The Problem Domain. Self-Stablization UC Davis Earl Barr. Lecture 1 Introduction Winter 2007

1 Introduction. 1.1 The Problem Domain. Self-Stablization UC Davis Earl Barr. Lecture 1 Introduction Winter 2007 Lecture 1 Introduction 1 Introduction 1.1 The Problem Domain Today, we are going to ask whether a system can recover from perturbation. Consider a children s top: If it is perfectly vertically, you can

More information

CS 542G: Conditioning, BLAS, LU Factorization

CS 542G: Conditioning, BLAS, LU Factorization CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

CS 6820 Fall 2014 Lectures, October 3-20, 2014

CS 6820 Fall 2014 Lectures, October 3-20, 2014 Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given

More information

Latches. October 13, 2003 Latches 1

Latches. October 13, 2003 Latches 1 Latches The second part of CS231 focuses on sequential circuits, where we add memory to the hardware that we ve already seen. Our schedule will be very similar to before: We first show how primitive memory

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers

More information

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators PARIS-SACLAY, FRANCE Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen b April the 6 th, 2016 a Université de Lille, Sciences et Technologies

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

number to represent the problem.

number to represent the problem. Domain: Operations and Algebraic Thinking Cluster 1 MAFS.1.OA.1.1 Use addition and subtraction within 20 to solve word problems involving situations of adding to, taking from, putting together, taking

More information

Recitation 5: Elementary Matrices

Recitation 5: Elementary Matrices Math 1b TA: Padraic Bartlett Recitation 5: Elementary Matrices Week 5 Caltech 2011 1 Random Question Consider the following two-player game, called Angels and Devils: Our game is played on a n n chessboard,

More information

Digital Systems EEE4084F. [30 marks]

Digital Systems EEE4084F. [30 marks] Digital Systems EEE4084F [30 marks] Practical 3: Simulation of Planet Vogela with its Moon and Vogel Spiral Star Formation using OpenGL, OpenMP, and MPI Introduction The objective of this assignment is

More information

Columbus State Community College Mathematics Department Public Syllabus

Columbus State Community College Mathematics Department Public Syllabus Columbus State Community College Mathematics Department Public Syllabus Course and Number: MATH 2568 Elementary Linear Algebra Credits: 4 Class Hours Per Week: 4 Prerequisites: MATH 2153 with a C or higher

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Positioning Sytems: Trilateration and Correlation

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Positioning Sytems: Trilateration and Correlation EECS 6A Designing Information Devices and Systems I Fall 08 Lecture Notes Note. Positioning Sytems: Trilateration and Correlation In this note, we ll introduce two concepts that are critical in our positioning

More information

Speculative Parallelism in Cilk++

Speculative Parallelism in Cilk++ Speculative Parallelism in Cilk++ Ruben Perez & Gregory Malecha MIT May 11, 2010 Ruben Perez & Gregory Malecha (MIT) Speculative Parallelism in Cilk++ May 11, 2010 1 / 33 Parallelizing Embarrassingly Parallel

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information

if b is a linear combination of u, v, w, i.e., if we can find scalars r, s, t so that ru + sv + tw = 0.

if b is a linear combination of u, v, w, i.e., if we can find scalars r, s, t so that ru + sv + tw = 0. Solutions Review # Math 7 Instructions: Use the following problems to study for Exam # which will be held Wednesday Sept For a set of nonzero vectors u v w} in R n use words and/or math expressions to

More information

Coordinate Update Algorithm Short Course The Package TMAC

Coordinate Update Algorithm Short Course The Package TMAC Coordinate Update Algorithm Short Course The Package TMAC Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 16 TMAC: A Toolbox of Async-Parallel, Coordinate, Splitting, and Stochastic Methods C++11 multi-threading

More information

Revisiting the Cache Miss Analysis of Multithreaded Algorithms

Revisiting the Cache Miss Analysis of Multithreaded Algorithms Revisiting the Cache Miss Analysis of Multithreaded Algorithms Richard Cole Vijaya Ramachandran September 19, 2012 Abstract This paper concerns the cache miss analysis of algorithms when scheduled in work-stealing

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

A Brief Introduction to Model Checking

A Brief Introduction to Model Checking A Brief Introduction to Model Checking Jan. 18, LIX Page 1 Model Checking A technique for verifying finite state concurrent systems; a benefit on this restriction: largely automatic; a problem to fight:

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

Parallel Sparse Matrix Vector Multiplication (PSC 4.3)

Parallel Sparse Matrix Vector Multiplication (PSC 4.3) Parallel Sparse Matrix Vector Multiplication (PSC 4.) original slides by Rob Bisseling, Universiteit Utrecht, accompanying the textbook Parallel Scientific Computation adapted for the lecture HPC Algorithms

More information

EXAMPLES OF CLASSICAL ITERATIVE METHODS

EXAMPLES OF CLASSICAL ITERATIVE METHODS EXAMPLES OF CLASSICAL ITERATIVE METHODS In these lecture notes we revisit a few classical fixpoint iterations for the solution of the linear systems of equations. We focus on the algebraic and algorithmic

More information

A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices*

A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* Fredrik Manne Department of Informatics, University of Bergen, N-5020 Bergen, Norway Fredrik. Manne@ii. uib. no

More information

Round-off error propagation and non-determinism in parallel applications

Round-off error propagation and non-determinism in parallel applications Round-off error propagation and non-determinism in parallel applications Vincent Baudoui (Argonne/Total SA) vincent.baudoui@gmail.com Franck Cappello (Argonne/INRIA/UIUC-NCSA) Georges Oppenheim (Paris-Sud

More information

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design Boolean Algebra, Logic Gates

ECE 250 / CPS 250 Computer Architecture. Basics of Logic Design Boolean Algebra, Logic Gates ECE 250 / CPS 250 Computer Architecture Basics of Logic Design Boolean Algebra, Logic Gates Benjamin Lee Slides based on those from Andrew Hilton (Duke), Alvy Lebeck (Duke) Benjamin Lee (Duke), and Amir

More information

EXPLOITING RESIDUE NUMBER SYSTEM FOR POWER-EFFICIENT DIGITAL SIGNAL PROCESSING IN EMBEDDED PROCESSORS

EXPLOITING RESIDUE NUMBER SYSTEM FOR POWER-EFFICIENT DIGITAL SIGNAL PROCESSING IN EMBEDDED PROCESSORS EXPLOITING RESIDUE NUMBER SYSTEM FOR POWER-EFFICIENT DIGITAL SIGNAL PROCESSING IN EMBEDDED PROCESSORS Rooju Chokshi 1, Krzysztof S. Berezowski 2,3, Aviral Shrivastava 2, Stanisław J. Piestrak 4 1 Microsoft

More information

A Quantitative Measure of Portability with Application to Bandwidth-Latency Models for Parallel Computing?

A Quantitative Measure of Portability with Application to Bandwidth-Latency Models for Parallel Computing? A Quantitative Measure of Portability with Application to Bandwidth-Latency Models for Parallel Computing? (Extended Abstract) Gianfranco Bilardi, Andrea Pietracaprina, and Geppino Pucci Dipartimento di

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method Jee Choi 1, Aparna Chandramowlishwaran 3, Kamesh Madduri 4, and Richard Vuduc 2 1 ECE, Georgia Tech 2 CSE, Georgia

More information

Software Engineering 2DA4. Slides 8: Multiplexors and More

Software Engineering 2DA4. Slides 8: Multiplexors and More Software Engineering 2DA4 Slides 8: Multiplexors and More Dr. Ryan Leduc Department of Computing and Software McMaster University Material based on S. Brown and Z. Vranesic, Fundamentals of Digital Logic

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

Communicating Parallel Processes. Stephen Brookes

Communicating Parallel Processes. Stephen Brookes Communicating Parallel Processes Stephen Brookes Carnegie Mellon University Deconstructing CSP 1 CSP sequential processes input and output as primitives named parallel composition synchronized communication

More information

A Note on S-Packing Colorings of Lattices

A Note on S-Packing Colorings of Lattices A Note on S-Packing Colorings of Lattices Wayne Goddard and Honghai Xu Department of Mathematical Sciences Clemson University, Clemson SC 964 {goddard,honghax}@clemson.edu Abstract Let a, a,..., a k be

More information

Real Time Operating Systems

Real Time Operating Systems Real Time Operating ystems Luca Abeni luca.abeni@unitn.it Interacting Tasks Until now, only independent tasks... A job never blocks or suspends A task only blocks on job termination In real world, jobs

More information

Discrete Tranformation of Output in Cellular Automata

Discrete Tranformation of Output in Cellular Automata Discrete Tranformation of Output in Cellular Automata Aleksander Lunøe Waage Master of Science in Computer Science Submission date: July 2012 Supervisor: Gunnar Tufte, IDI Norwegian University of Science

More information

A New Task Model and Utilization Bound for Uniform Multiprocessors

A New Task Model and Utilization Bound for Uniform Multiprocessors A New Task Model and Utilization Bound for Uniform Multiprocessors Shelby Funk Department of Computer Science, The University of Georgia Email: shelby@cs.uga.edu Abstract This paper introduces a new model

More information

Partition is reducible to P2 C max. c. P2 Pj = 1, prec Cmax is solvable in polynomial time. P Pj = 1, prec Cmax is NP-hard

Partition is reducible to P2 C max. c. P2 Pj = 1, prec Cmax is solvable in polynomial time. P Pj = 1, prec Cmax is NP-hard I. Minimizing Cmax (Nonpreemptive) a. P2 C max is NP-hard. Partition is reducible to P2 C max b. P Pj = 1, intree Cmax P Pj = 1, outtree Cmax are both solvable in polynomial time. c. P2 Pj = 1, prec Cmax

More information

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering

More information

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest

More information

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Positioning Sytems: Trilateration and Correlation

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Positioning Sytems: Trilateration and Correlation EECS 6A Designing Information Devices and Systems I Fall 08 Lecture Notes Note. Positioning Sytems: Trilateration and Correlation In this note, we ll introduce two concepts that are critical in our positioning

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

Probability Propagation

Probability Propagation Graphical Models, Lectures 9 and 10, Michaelmas Term 2009 November 13, 2009 Characterizing chordal graphs The following are equivalent for any undirected graph G. (i) G is chordal; (ii) G is decomposable;

More information