An Integrative Model for Parallelism
|
|
- Domenic Banks
- 5 years ago
- Views:
Transcription
1 An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09
2 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2
3 Introduction tw-12-exascale 2012/01/09 3
4 Practical starting point Data movement is very important: need to model this I m starting with distributed memory. We understand parallelism through MPI (running at 300,000 cores); the problem is one of programmability. Goal To find a high level way of parallel programming, that can be compiled down to what I would have written in MPI high productivity because working in global terms portable performance since the MPI code is derived Pie in the sky Apart from distributed memory, we would like to cover task parallelism, multicore, accelerators tw-12-exascale 2012/01/09 4
5 My claim It is possible to have a global model of parallelism where MPI-type parallelism can be derived through formal reasoning about global objects. The model is versatile in dealing with applications and memory models beyond linear algebra on clusters. Prospects for a high productivity high performance way of parallel programming. No silver bullet: algorithm designer still needs to design data and work distributions. tw-12-exascale 2012/01/09 5
6 Claim in CS terms Given distributions on input, output, instructions: a communication pattern becomes a formally derived first-class object; communication is an instantiation of this object Consequences: communication is properly aggregated data can be sent (asynchronously) when available inspector-executor model where the pattern is derived once, then applied many times. tw-12-exascale 2012/01/09 6
7 Working title: I/MP: Integrative Model for Parallelism tw-12-exascale 2012/01/09 7
8 Formal part tw-12-exascale 2012/01/09 8
9 Theoretical model Computation is bipartite graph from input to output, edges are instructions Edge: A = In, Out, E (α, β) E : α In, β Out is elementary computation between data elements. This models a kernel, (not a whole code), possibly even less: compare supersteps in BSP tw-12-exascale 2012/01/09 9
10 Parallel computation: where A = A 1,..., A P, A p = In p, Out p, E p, In = p In p, Out = p Out p, E = p E p ; Possibly non-disjoint partitions (redundant work) Example. If all In p, Out p identical: shared memory Note explicit distribution of work, independently of data tw-12-exascale 2012/01/09 10
11 Where does communication come from? Define total input/output of a processor: In(E p ) = {α: (α, β) E p }, Out(E p ) = {β : (α, β) E p }. Not the same as In p, Out p! In p In(E p) Out p Out(E p) local part of input total needed input local part of output total computed output Communication: In(E p ) In p is communicated in before computation Communication: Out(E p ) Out p is communicated out after computation Example. Owner computes: Out(E p ) = Out p Example. Embarrassingly parallel: In(E p ) = In p, Out(E p ) = Out p, and everything disjoint By defining the distributions, communication becomes formally derivable tw-12-exascale 2012/01/09 11
12 Object view of communication Analyze all computation edges in terms of p, q source/target processors The sum over all p, q of these is an object describing the total communication in a kernel Compare to a PETSc VecScatter object: describe what goes where, not how. Compare to active messages: communication and computation in one. Communication objects are derivable from the algorithm, they are not defined during the execution of the communication; inspector/executor paradigm tw-12-exascale 2012/01/09 12
13 Examples tw-12-exascale 2012/01/09 13
14 What am I going to show you? The algorithm is mathematics Data distributions are easy to design The communication is derived through formal reasoning Some distributions lead to efficient communication, some don t: we still need a smart programmer. Today I m only showing your math, but I think it s implementable tw-12-exascale 2012/01/09 14
15 Example: Matrix-vector product In p local part of input In(E p) total needed input Out p local part of output Out(E p) total computed output i : y i = j a ij x j. Split computation in temporaries and reduction: i : y i = j a ijx j = j t ij, t ij = a ij x j. Work distribution: t ij is computed where a ij lives We will derive communication for row/col-distributed matrix tw-12-exascale 2012/01/09 15
16 MVP by rows Data: In p = x(i p ) In p In(E p) Out p Out(E p) local part of input total needed input local part of output total computed output work: { E p = {τ ij : i I p }, τ ij := t ij a ij x j derived fact: In(E p ) = x( ) allgather tw-12-exascale 2012/01/09 16
17 MVP by rows Data: In p = t(i p, ) work: { E p = {η i : i I p } η i := y i j t ij derived fact: In p In(E p) Out p Out(E p) In(E p ) = t(i p, ) local operation local part of input total needed input local part of output total computed output tw-12-exascale 2012/01/09 17
18 MVP by columns t ij calculation: In p In(E p) Out p Out(E p) local part of input total needed input local part of output total computed output local In(E p ) = x(i p ), In p = x(i p ) Out(E p ) = t(, I p ) tw-12-exascale 2012/01/09 18
19 MVP by columns reduction: In p In(E p) Out p Out(E p) local part of input total needed input local part of output total computed output In(E p ) = t(i p, ) but now derived: In p = t(, I p ) global transpose or reduce-scatter tw-12-exascale 2012/01/09 19
20 Sparse MVP A little more notation, but no essential difference tw-12-exascale 2012/01/09 20
21 LU solution Derivation of messages is identical, now no longer data parallel communication. VecPipeline instead of VecScatter, dataflow Note: IMP kernels do not imply synchronization: dataflow, supply driven execution Example: one IMP kernel per level in an FFT butterfly operations execute based on availability of data (a bit like DAG scheduling but not entirely) tw-12-exascale 2012/01/09 21
22 Complicated example: n-body problems Tree computation framework according to Katzenelson 1989: The field due to cell i on level l is given by g(l, i) = g(l + 1, j) j C(i) where C(i) denotes the set of children of cell i The field felt by cell i on level l is given by f (l, i) = f (l 1, p(i)) + g(l, j) j I l (i) p(i): the parent cell of i; I l (i): interaction region of i tw-12-exascale 2012/01/09 22
23 Implementation E (g) = E τ E γ, { i j C(i) : i : E τ = {τ l ij}, E γ = {γ l i } τij l = tl ij = g l+1 j γi l = gi l = j C(i) tij l E (f ) = E ρ E σ E φ, E ρ = {ρ l i }, E σ = {σij}, l E φ = {φ l i } i : ρ l i = ri l = f l 1 p(i) i j Il (i) : σij l = sl ij = g j l i : φ l i = fi l = ri l + j I l (i) sl ij tw-12-exascale 2012/01/09 23
24 Extension to other memory models tw-12-exascale 2012/01/09 24
25 Attached accelerators tw-12-exascale 2012/01/09 25
26 Decompose GPU-GPU operations E as e = A 1 e A where A is memory copy from CPU to GPU, so e = A e A 1 is between CPUs: quotient graph. We have derived the MPI communication of an algorithm with distributed GPUs through formal means tw-12-exascale 2012/01/09 26
27 Shared memory multicore Processors still operate on local stores (L1/L2 cache) So it s really distributed memory except with the shared memory as network again factor out physical data movement from logical data movement (this still needs work) tw-12-exascale 2012/01/09 27
28 Conclusion tw-12-exascale 2012/01/09 28
29 Conclusion Integrative Model for Parallelism also: Interface without Message Passing Based on explicit distribution of everything that needs distributing Communication objects derived through formal reasoning Trivially covers distributed memory, can be interpreted for other models Theoretical questions remain: correctness, deadlock, relation to DAG models, interpretation as functional model, relation to BSP, suitability for graph algorithms Practical question: how do you implement this? (Inspiration from PETSc, ParPre, Charm++, elemental) tw-12-exascale 2012/01/09 29
Accelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationHigh Performance Computing
Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),
More informationDynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationLightweight Superscalar Task Execution in Distributed Memory
Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,
More informationSpecial Nodes for Interface
fi fi Special Nodes for Interface SW on processors Chip-level HW Board-level HW fi fi C code VHDL VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster
More informationOptimization Techniques for Parallel Code 1. Parallel programming models
Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of
More informationLecture 12: Adders, Sequential Circuits
Lecture 12: Adders, Sequential Circuits Today s topics: Carry-lookahead adder Clocks, latches, sequential circuits 1 Speed of Ripple Carry The carry propagates thru every 1-bit box: each 1-bit box sequentially
More informationLecture 12: Adders, Sequential Circuits
Lecture 12: Adders, Sequential Circuits Today s topics: Carry-lookahead adder Clocks, latches, sequential circuits 1 Incorporating beq Perform a b and confirm that the result is all zero s Source: H&P
More informationNCU EE -- DSP VLSI Design. Tsung-Han Tsai 1
NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using
More informationLecture 13: Sequential Circuits, FSM
Lecture 13: Sequential Circuits, FSM Today s topics: Sequential circuits Finite state machines 1 Clocks A microprocessor is composed of many different circuits that are operating simultaneously if each
More information11 Parallel programming models
237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages
More informationLecture 7: Dynamic Programming I: Optimal BSTs
5-750: Graduate Algorithms February, 06 Lecture 7: Dynamic Programming I: Optimal BSTs Lecturer: David Witmer Scribes: Ellango Jothimurugesan, Ziqiang Feng Overview The basic idea of dynamic programming
More informationPanorama des modèles et outils de programmation parallèle
Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &
More informationOverview: Synchronous Computations
Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous
More informationLecture 13: Sequential Circuits
Lecture 13: Sequential Circuits Today s topics: Carry-lookahead adder Clocks and sequential circuits Finite state machines Reminder: Assignment 5 due on Thursday 10/12, mid-term exam Tuesday 10/24 1 Speed
More information1 / 28. Parallel Programming.
1 / 28 Parallel Programming pauldj@aices.rwth-aachen.de Collective Communication 2 / 28 Barrier Broadcast Reduce Scatter Gather Allgather Reduce-scatter Allreduce Alltoall. References Collective Communication:
More informationComputational complexity theory
Computational complexity theory Introduction to computational complexity theory Complexity (computability) theory deals with two aspects: Algorithm s complexity. Problem s complexity. References S. Cook,
More informationMulticore Parallelization of Determinant Quantum Monte Carlo Simulations
Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationFaster arithmetic for number-theoretic transforms
University of New South Wales 7th October 2011, Macquarie University Plan for talk 1. Review number-theoretic transform (NTT) 2. Discuss typical butterfly algorithm 3. Improvements to butterfly algorithm
More informationB629 project - StreamIt MPI Backend. Nilesh Mahajan
B629 project - StreamIt MPI Backend Nilesh Mahajan March 26, 2013 Abstract StreamIt is a language based on the dataflow model of computation. StreamIt consists of computation units called filters connected
More informationMulticore Semantics and Programming
Multicore Semantics and Programming Peter Sewell Tim Harris University of Cambridge Oracle October November, 2015 p. 1 These Lectures Part 1: Multicore Semantics: the concurrency of multiprocessors and
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationFinite-State Model Checking
EECS 219C: Computer-Aided Verification Intro. to Model Checking: Models and Properties Sanjit A. Seshia EECS, UC Berkeley Finite-State Model Checking G(p X q) Temporal logic q p FSM Model Checker Yes,
More informationAlgorithms as multilinear tensor equations
Algorithms as multilinear tensor equations Edgar Solomonik Department of Computer Science ETH Zurich Technische Universität München 18.1.2016 Edgar Solomonik Algorithms as multilinear tensor equations
More informationComputational complexity theory
Computational complexity theory Introduction to computational complexity theory Complexity (computability) theory deals with two aspects: Algorithm s complexity. Problem s complexity. References S. Cook,
More informationEEC 281 VLSI Digital Signal Processing Notes on the RRI-FFT
EEC 281 VLSI Digital Signal Processing Notes on the RRI-FFT Bevan Baas The attached description of the RRI-FFT is taken from a chapter describing the cached-fft algorithm and there are therefore occasional
More informationMatrix Assembly in FEA
Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,
More informationModel Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University
Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul
More informationCausality and Time. The Happens-Before Relation
Causality and Time The Happens-Before Relation Because executions are sequences of events, they induce a total order on all the events It is possible that two events by different processors do not influence
More informationParallel LU Decomposition (PSC 2.3) Lecture 2.3 Parallel LU
Parallel LU Decomposition (PSC 2.3) 1 / 20 Designing a parallel algorithm Main question: how to distribute the data? What data? The matrix A and the permutation π. Data distribution + sequential algorithm
More informationDynamic Scheduling within MAGMA
Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing
More informationNotes on the Matrix-Tree theorem and Cayley s tree enumerator
Notes on the Matrix-Tree theorem and Cayley s tree enumerator 1 Cayley s tree enumerator Recall that the degree of a vertex in a tree (or in any graph) is the number of edges emanating from it We will
More informationMA554 Assessment 1 Cosets and Lagrange s theorem
MA554 Assessment 1 Cosets and Lagrange s theorem These are notes on cosets and Lagrange s theorem; they go over some material from the lectures again, and they have some new material it is all examinable,
More informationAlgorithms Exam TIN093 /DIT602
Algorithms Exam TIN093 /DIT602 Course: Algorithms Course code: TIN 093, TIN 092 (CTH), DIT 602 (GU) Date, time: 21st October 2017, 14:00 18:00 Building: SBM Responsible teacher: Peter Damaschke, Tel. 5405
More informationCosets and Lagrange s theorem
Cosets and Lagrange s theorem These are notes on cosets and Lagrange s theorem some of which may already have been lecturer. There are some questions for you included in the text. You should write the
More informationOBEUS. (Object-Based Environment for Urban Simulation) Shareware Version. Itzhak Benenson 1,2, Slava Birfur 1, Vlad Kharbash 1
OBEUS (Object-Based Environment for Urban Simulation) Shareware Version Yaffo model is based on partition of the area into Voronoi polygons, which correspond to real-world houses; neighborhood relationship
More informationFast Multipole Methods: Fundamentals & Applications. Ramani Duraiswami Nail A. Gumerov
Fast Multipole Methods: Fundamentals & Applications Ramani Duraiswami Nail A. Gumerov Week 1. Introduction. What are multipole methods and what is this course about. Problems from physics, mathematics,
More informationAntonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg
INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The
More informationBinary addition by hand. Adding two bits
Chapter 3 Arithmetic is the most basic thing you can do with a computer We focus on addition, subtraction, multiplication and arithmetic-logic units, or ALUs, which are the heart of CPUs. ALU design Bit
More informationBinary addition example worked out
Binary addition example worked out Some terms are given here Exercise: what are these numbers equivalent to in decimal? The initial carry in is implicitly 0 1 1 1 0 (Carries) 1 0 1 1 (Augend) + 1 1 1 0
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More informationECE/CS 250 Computer Architecture
ECE/CS 250 Computer Architecture Basics of Logic Design: Boolean Algebra, Logic Gates (Combinational Logic) Tyler Bletsch Duke University Slides are derived from work by Daniel J. Sorin (Duke), Alvy Lebeck
More informationExploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units
Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern
More informationParallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationTrinity Christian School Curriculum Guide
Course Title: Calculus Grade Taught: Twelfth Grade Credits: 1 credit Trinity Christian School Curriculum Guide A. Course Goals: 1. To provide students with a familiarity with the properties of linear,
More informationUsing Kernel Couplings to Predict Parallel Application Performance
Using Kernel Couplings to Predict Parallel Application Performance Valerie Taylor, Xingfu Wu, Jonathan Geisler Department of Electrical and Computer Engineering, Northwestern University, Evanston IL 60208
More informationSCHUR-WEYL DUALITY FOR U(n)
SCHUR-WEYL DUALITY FOR U(n) EVAN JENKINS Abstract. These are notes from a lecture given in MATH 26700, Introduction to Representation Theory of Finite Groups, at the University of Chicago in December 2009.
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationOblivious Algorithms for Multicores and Network of Processors
Oblivious Algorithms for Multicores and Network of Processors Rezaul Alam Chowdhury, Francesco Silvestri, Brandon Blakeley and Vijaya Ramachandran Department of Computer Sciences University of Texas Austin,
More informationNumerical Linear Algebra
Numerical Linear Algebra By: David McQuilling; Jesus Caban Deng Li Jan.,31,006 CS51 Solving Linear Equations u + v = 8 4u + 9v = 1 A x b 4 9 u v = 8 1 Gaussian Elimination Start with the matrix representation
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More information1 Introduction. 1.1 The Problem Domain. Self-Stablization UC Davis Earl Barr. Lecture 1 Introduction Winter 2007
Lecture 1 Introduction 1 Introduction 1.1 The Problem Domain Today, we are going to ask whether a system can recover from perturbation. Consider a children s top: If it is perfectly vertically, you can
More informationCS 542G: Conditioning, BLAS, LU Factorization
CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationCS 6820 Fall 2014 Lectures, October 3-20, 2014
Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given
More informationLatches. October 13, 2003 Latches 1
Latches The second part of CS231 focuses on sequential circuits, where we add memory to the hardware that we ve already seen. Our schedule will be very similar to before: We first show how primitive memory
More informationParallelization of the QC-lib Quantum Computer Simulator Library
Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers
More informationEnergy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators
PARIS-SACLAY, FRANCE Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen b April the 6 th, 2016 a Université de Lille, Sciences et Technologies
More informationParallelization of the QC-lib Quantum Computer Simulator Library
Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/
More informationnumber to represent the problem.
Domain: Operations and Algebraic Thinking Cluster 1 MAFS.1.OA.1.1 Use addition and subtraction within 20 to solve word problems involving situations of adding to, taking from, putting together, taking
More informationRecitation 5: Elementary Matrices
Math 1b TA: Padraic Bartlett Recitation 5: Elementary Matrices Week 5 Caltech 2011 1 Random Question Consider the following two-player game, called Angels and Devils: Our game is played on a n n chessboard,
More informationDigital Systems EEE4084F. [30 marks]
Digital Systems EEE4084F [30 marks] Practical 3: Simulation of Planet Vogela with its Moon and Vogel Spiral Star Formation using OpenGL, OpenMP, and MPI Introduction The objective of this assignment is
More informationColumbus State Community College Mathematics Department Public Syllabus
Columbus State Community College Mathematics Department Public Syllabus Course and Number: MATH 2568 Elementary Linear Algebra Credits: 4 Class Hours Per Week: 4 Prerequisites: MATH 2153 with a C or higher
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationDesigning Information Devices and Systems I Fall 2018 Lecture Notes Note Positioning Sytems: Trilateration and Correlation
EECS 6A Designing Information Devices and Systems I Fall 08 Lecture Notes Note. Positioning Sytems: Trilateration and Correlation In this note, we ll introduce two concepts that are critical in our positioning
More informationSpeculative Parallelism in Cilk++
Speculative Parallelism in Cilk++ Ruben Perez & Gregory Malecha MIT May 11, 2010 Ruben Perez & Gregory Malecha (MIT) Speculative Parallelism in Cilk++ May 11, 2010 1 / 33 Parallelizing Embarrassingly Parallel
More informationBlock AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark
Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise
More informationif b is a linear combination of u, v, w, i.e., if we can find scalars r, s, t so that ru + sv + tw = 0.
Solutions Review # Math 7 Instructions: Use the following problems to study for Exam # which will be held Wednesday Sept For a set of nonzero vectors u v w} in R n use words and/or math expressions to
More informationCoordinate Update Algorithm Short Course The Package TMAC
Coordinate Update Algorithm Short Course The Package TMAC Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 16 TMAC: A Toolbox of Async-Parallel, Coordinate, Splitting, and Stochastic Methods C++11 multi-threading
More informationRevisiting the Cache Miss Analysis of Multithreaded Algorithms
Revisiting the Cache Miss Analysis of Multithreaded Algorithms Richard Cole Vijaya Ramachandran September 19, 2012 Abstract This paper concerns the cache miss analysis of algorithms when scheduled in work-stealing
More informationComputing least squares condition numbers on hybrid multicore/gpu systems
Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning
More informationINF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)
INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder
More informationA Brief Introduction to Model Checking
A Brief Introduction to Model Checking Jan. 18, LIX Page 1 Model Checking A technique for verifying finite state concurrent systems; a benefit on this restriction: largely automatic; a problem to fight:
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More informationParallel Sparse Matrix Vector Multiplication (PSC 4.3)
Parallel Sparse Matrix Vector Multiplication (PSC 4.) original slides by Rob Bisseling, Universiteit Utrecht, accompanying the textbook Parallel Scientific Computation adapted for the lecture HPC Algorithms
More informationEXAMPLES OF CLASSICAL ITERATIVE METHODS
EXAMPLES OF CLASSICAL ITERATIVE METHODS In these lecture notes we revisit a few classical fixpoint iterations for the solution of the linear systems of equations. We focus on the algebraic and algorithmic
More informationA Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices*
A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* Fredrik Manne Department of Informatics, University of Bergen, N-5020 Bergen, Norway Fredrik. Manne@ii. uib. no
More informationRound-off error propagation and non-determinism in parallel applications
Round-off error propagation and non-determinism in parallel applications Vincent Baudoui (Argonne/Total SA) vincent.baudoui@gmail.com Franck Cappello (Argonne/INRIA/UIUC-NCSA) Georges Oppenheim (Paris-Sud
More informationECE 250 / CPS 250 Computer Architecture. Basics of Logic Design Boolean Algebra, Logic Gates
ECE 250 / CPS 250 Computer Architecture Basics of Logic Design Boolean Algebra, Logic Gates Benjamin Lee Slides based on those from Andrew Hilton (Duke), Alvy Lebeck (Duke) Benjamin Lee (Duke), and Amir
More informationEXPLOITING RESIDUE NUMBER SYSTEM FOR POWER-EFFICIENT DIGITAL SIGNAL PROCESSING IN EMBEDDED PROCESSORS
EXPLOITING RESIDUE NUMBER SYSTEM FOR POWER-EFFICIENT DIGITAL SIGNAL PROCESSING IN EMBEDDED PROCESSORS Rooju Chokshi 1, Krzysztof S. Berezowski 2,3, Aviral Shrivastava 2, Stanisław J. Piestrak 4 1 Microsoft
More informationA Quantitative Measure of Portability with Application to Bandwidth-Latency Models for Parallel Computing?
A Quantitative Measure of Portability with Application to Bandwidth-Latency Models for Parallel Computing? (Extended Abstract) Gianfranco Bilardi, Andrea Pietracaprina, and Geppino Pucci Dipartimento di
More informationParallel Numerics. Scope: Revise standard numerical methods considering parallel computations!
Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:
More informationA CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method
A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method Jee Choi 1, Aparna Chandramowlishwaran 3, Kamesh Madduri 4, and Richard Vuduc 2 1 ECE, Georgia Tech 2 CSE, Georgia
More informationSoftware Engineering 2DA4. Slides 8: Multiplexors and More
Software Engineering 2DA4 Slides 8: Multiplexors and More Dr. Ryan Leduc Department of Computing and Software McMaster University Material based on S. Brown and Z. Vranesic, Fundamentals of Digital Logic
More informationPerformance Analysis of Lattice QCD Application with APGAS Programming Model
Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models
More informationCommunicating Parallel Processes. Stephen Brookes
Communicating Parallel Processes Stephen Brookes Carnegie Mellon University Deconstructing CSP 1 CSP sequential processes input and output as primitives named parallel composition synchronized communication
More informationA Note on S-Packing Colorings of Lattices
A Note on S-Packing Colorings of Lattices Wayne Goddard and Honghai Xu Department of Mathematical Sciences Clemson University, Clemson SC 964 {goddard,honghax}@clemson.edu Abstract Let a, a,..., a k be
More informationReal Time Operating Systems
Real Time Operating ystems Luca Abeni luca.abeni@unitn.it Interacting Tasks Until now, only independent tasks... A job never blocks or suspends A task only blocks on job termination In real world, jobs
More informationDiscrete Tranformation of Output in Cellular Automata
Discrete Tranformation of Output in Cellular Automata Aleksander Lunøe Waage Master of Science in Computer Science Submission date: July 2012 Supervisor: Gunnar Tufte, IDI Norwegian University of Science
More informationA New Task Model and Utilization Bound for Uniform Multiprocessors
A New Task Model and Utilization Bound for Uniform Multiprocessors Shelby Funk Department of Computer Science, The University of Georgia Email: shelby@cs.uga.edu Abstract This paper introduces a new model
More informationPartition is reducible to P2 C max. c. P2 Pj = 1, prec Cmax is solvable in polynomial time. P Pj = 1, prec Cmax is NP-hard
I. Minimizing Cmax (Nonpreemptive) a. P2 C max is NP-hard. Partition is reducible to P2 C max b. P Pj = 1, intree Cmax P Pj = 1, outtree Cmax are both solvable in polynomial time. c. P2 Pj = 1, prec Cmax
More informationBuilding a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI
Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering
More informationExploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning
Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning Erin C. Carson, Nicholas Knight, James Demmel, Ming Gu U.C. Berkeley SIAM PP 12, Savannah, Georgia, USA, February
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest
More informationDesigning Information Devices and Systems I Fall 2018 Lecture Notes Note Positioning Sytems: Trilateration and Correlation
EECS 6A Designing Information Devices and Systems I Fall 08 Lecture Notes Note. Positioning Sytems: Trilateration and Correlation In this note, we ll introduce two concepts that are critical in our positioning
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012
MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7
More informationProbability Propagation
Graphical Models, Lectures 9 and 10, Michaelmas Term 2009 November 13, 2009 Characterizing chordal graphs The following are equivalent for any undirected graph G. (i) G is chordal; (ii) G is decomposable;
More information