1 / 28. Parallel Programming.
|
|
- Chrystal Fox
- 5 years ago
- Views:
Transcription
1 1 / 28 Parallel Programming pauldj@aices.rwth-aachen.de
2 Collective Communication 2 / 28 Barrier Broadcast Reduce Scatter Gather Allgather Reduce-scatter Allreduce Alltoall. References Collective Communication: Theory, Practice, and Experience, Chan, Heimlich, Purkayastha, van de Geijn. (FLAME working note #22) Collective Communications in MPI
3 Collective Communication 3 / 28 Synchronization Barrier Almost never needed! Data Movement Broadcast, Scatter, Gather, Allgather, Alltoall Reductions Reduce, Reduce-scatter, Allreduce, Scan,... For all collectives: no tags; blocking.
4 4 / 28 int MPI_BCast(...) Before: α After: α α α α
5 5 / 28 int MPI_Reduce(...) Before: δ 0 δ 1 δ 2 δ 3 After: p 1 Op i=0 δ i MPI_Op: MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND,..., MPI_Datatype: MPI_CHAR, MPI_INT, MPI_UNSIGNED, MPI_FLOAT, MPI_DOUBLE,...
6 6 / 28 int MPI_Scatter(...) Before: v[0] v[1] v[2] v[3] After: v[0] v[1] v[2] v[3]
7 7 / 28 int MPI_Gather(...) Before: v[0] v[1] v[2] v[3] After: v[0] v[1] v[2] v[3]
8 8 / 28 int MPI_Allgather(...) Before: v[0] v[1] v[2] v[3] After: v[0] v[0] v[0] v[0] v[1] v[1] v[1] v[1] v[2] v[2] v[2] v[2] v[3] v[3] v[3] v[3]
9 9 / 28 int MPI_Reduce_scatter(...) Before: v 0[0] v 1[0] v 2[0] v 3[0] v 0[1] v 1[1] v 2[1] v 3[1] v 0[2] v 1[2] v 2[2] v 3[2] v 0[3] v 1[3] v 2[3] v 3[3] After: Op v i[0] i Op v i[1] i Op v i[2] i Op v i[3] i
10 10 / 28 int MPI_Allreduce(...) Before: δ 0 δ 1 δ 2 δ 3 After: p 1 Op δ i i=0 p 1 Op δ i i=0 p 1 Op δ i i=0 p 1 Op δ i i=0
11 11 / 28 int MPI_Alltoall(...) Before: v 0[0] v 1[0] v 2[0] v 3[0] v 0[1] v 1[1] v 2[1] v 3[1] v 0[2] v 1[2] v 2[2] v 3[2] v 0[3] v 1[3] v 2[3] v 3[3] After: v 0[0] v 0[1] v 0[2] v 0[3] v 1[0] v 1[1] v 1[2] v 1[3] v 2[0] v 2[1] v 2[2] v 2[3] v 3[0] v 3[1] v 3[2] v 3[3]
12 More Collectives 12 / 28 Variable length MPI_Scatterv MPI_Gatherv MPI_Allgatherv MPI_Alltoallv Partial reduction MPI_Scan Non-blocking collectives MPI_I* Multiple communicators Example 1) World (everybody) + Workers (everybody-masters) + Masters (everybody-workers) Example 2) r c mesh of processes; communication within rows/columns only MPI_Comm_create(...) MPI_Comm_split(...)...
13 More Collectives More Communicators 12 / 28 Variable length MPI_Scatterv MPI_Gatherv MPI_Allgatherv MPI_Alltoallv Partial reduction MPI_Scan Non-blocking collectives MPI_I* Multiple communicators Example 1) World (everybody) + Workers (everybody-masters) + Masters (everybody-workers) Example 2) r c mesh of processes; communication within rows/columns only MPI_Comm_create(...) MPI_Comm_split(...)...
14 Collective Communication: Lower Bounds 13 / 28 Cost of communication: α + nβ Cost of computation: γ #ops α = latency, startup n = length of the message p = # of processes β = 1/ bandwidth γ = cost of 1 flop Primitive Latency Bandwidth Computation Broadcast log 2 (p) α nβ - p 1 Reduce log 2 (p) α nβ p nγ p 1 Scatter log 2 (p) α p nβ - p 1 Gather log 2 (p) α p nβ - p 1 Allgather log 2 (p) α p nβ - p 1 Reduce-Scatter log 2 (p) α p nβ p 1 p
15 Implementation of Bcast and Reduce 14 / 28 IDEA: recursive doubling / Minimum Spanning Tree (MST) At each step, double the number of active processes. How to map the idea to the specific topology? ring: linear doubling (2d) mesh: 1 dimension first, then another, then another... hypercube: obvious, same as mesh Cost? # steps: log 2 p cost(step): α + nβ total time: log 2 (p)α + log 2 (p)nβ lower bound: log 2 (p)α + nβ note: cost(p 2 ) = 2 cost(p)! Reduce BCast in reverse; cost(computation)?
16 Implementation of Gather (and Scatter) 15 / 28 IDEA: MST again At step i, only 1 2 i -th of the message is sent # steps: log 2 p cost(step i ): α + n 2 i β total time: log 2 (p) i=1 α + n 2 i β = log 2(p)α + p 1 p nβ lower bound: log 2 (p)α + p 1 p nβ optimal!
17 Implementation of Allgather (and Reduce-scatter) IDEA: Recursive-doubling (bidirectional exchange) Recursive allgather of half data + exchange data between disjoint nodes. v[0] v[1] v[2] v[3] v[0] v[1] v[0] v[1] v[2] v[3] v[2] v[3] # steps: log 2 p total time: log 2 (p) i=1 α + n 2 β = log i 2 (p)α + p 1 p nβ v[0] v[0] v[0] v[0] v[1] v[1] v[1] v[1] v[2] v[2] v[2] v[2] v[3] v[3] v[3] v[3] 16 / 28
18 Another Implementation of Allgather 17 / 28 IDEA: Cyclic algorithm v[0] v[1] v[2] v[3] v[0] v[3] v[0] v[1] v[1] v[2] v[2] v[3] # steps: p 1 total time: p 1 α + n p 1 β = (p 1)α + p p nβ i=1 v[0] v[0] v[0] v[1] v[1] v[1] v[2] v[2] v[2] v[3] v[3] v[3]
19 Another Implementation of Bcast 18 / 28 IDEA: Scatter + cyclic algorithm
20 Dense Matrix-Vector Product: Scalability 19 / 28 1D matrix distribution y := Ax, x, y R n, A R n n A is partitioned by rows and distributed over p processes. Process i owns the block of rows A i. Similarly for x: process i owns x i. Goal: Compute y and distribute it the same way as x. A = A 0 A 1., x = x 0 x 1., and y = y 0 y 1. A p 1 x p 1 y p 1 Note: A i, x i and y i indicate a block of rows, as opposed to a single one.
21 Dense Matrix-Vector Product: Scalability 1D matrix distribution Algorithm 1 x = Allgather(x i ) x becomes available to every process 2 y i = A i x local computation Parallel cost (lower bound for T p (n)) 1 log 2 (p) α + p 1 p nβ log 2(p)α + nβ 2 2 n2 p γ Sequential cost T 1 (n) = 2n 2 γ 20 / 28
22 GEMV: Scalability 21 / 28 1D matrix distribution Speedup Efficiency Strong scalability Weak scalability S p (n) = T 1(n) T p (n) = E p (n) = S p(n) p = lim E p(n) = 0 p 2n 2 γ log 2 (p)α + nβ + 2 n2 p γ p log 2 (p) α 2n 2 γ + p β 2n γ M = local memory; Mp = combined memory; n M = largest problem solvable with p processes n 2 M = Mp lim E p(n M ) = p log 2 (p) α 2M γ + p 2 β M γ = 0
23 Exercise: y = Ax, A distributed by columns 22 / 28 A is partitioned by columns and distributed over p processes. Process i owns the block of columns A i. Process i owns x i. Goal: Compute y and distribute it the same way as x. A = A 0 A 1... A p 1, x = x 0 x 1., and y = y 0 y 1. x p 1 y p 1
24 Scalability Algorithm 1 y (i) = A i x i local computation 2 y = Reduce-Scatter(y (i) ) reduction-sum of y (i) s + scatter Parallel cost 1 2 n2 p γ 2 log 2 (p) α + p 1 p p 1 nβ + p nγ log 2(p)α + n(β + γ) 3 Lower bound for T p (n) = log 2 (p)α + n(β + γ) + 2 n2 p γ T p (n) has one extra term (nγ) with respect to the case of A partitioned by rows, therefore this algorithm is also not scalable. 23 / 28
25 GEMV: Scalability 24 / 28 2D matrix distribution A is partitioned according to a 2D p = r c mesh; The (i, j) process (P ij ) owns the block A ij. x and y are partitioned in p chunks; x is mapped to the mesh by colums, y by rows. Example: p = 2 3 [ A00 A A = 01 A 02 A 10 A 11 A 12 ], x = x 0 x 1. x 5, and y = y 0 y 1. y 5 P ij owns A ij, P 00 owns x 0 P 10 owns x 1 P 01 owns x 2, and P 11 owns x 3 P 02 owns x 4 P 12 owns x 5 P 00 owns y 0 P 01 owns y 1 P 02 owns y 2 P 10 owns y 3 P 11 owns y 4 P 12 owns y 5.
26 GEMV: Scalability 25 / 28 2D matrix distribution Algorithm 1 x I = Allgather(x i) within columns x I is a block 2 y J = A ijx I local computation 3 y j = Reduce-scatter(y J ) within rows Parallel cost (lower bound for T p(n)) 1 log 2 (r) α + r 1 nβ log p 2 (r)α + n β c 2 2 n2 p γ 3 log 2 (c) α + c 1 p nβ + c 1 nγ log 2 (c)α + n β + n γ r r p Sequential cost T 1(n) = 2n 2 γ
27 GEMV: Scalability 26 / 28 2D matrix distribution Parallel cost: assuming r = c = p: Speedup S p (n) = ( 2 n2 n p γ + (log 2(r) + log 2 (c)) α + c + n ) β + n r r γ 2 n2 p γ + log 2(p)α + n p (2β + γ) p 1 + p log 2 (p) α 2n 2 Efficiency E p (n) = S p (n)/p =... γ + p 2n 2β+γ γ Strong scalability lim E p(n) = 0 p Weak scalability M = local memory; Mp = combined memory; n M = largest problem solvable with p processes n 2 M = Mp lim E p(n M ) = p log 2 (p) α 2M γ β+γ M γ = 0...
28 27 / 28 Exercise 2D matrix distribution A is partitioned according to a 2D p = r c mesh; The (i, j) process (P ij ) owns the block A ij. Both x and y are partitioned in p chunks, and mapped to the mesh by colums. Example: p = 2 3 [ A00 A A = 01 A 02 A 10 A 11 A 12 ], x = x 0 x 1. x 5, and y = y 0 y 1. y 5 P ij owns A ij, P 00 owns x 0 P 10 owns x 1 P 01 owns x 2, and P 11 owns x 3 P 02 owns x 4 P 12 owns x 5 P 00 owns y 0 P 10 owns y 1 P 01 owns y 2 P 11 owns y 3 P 02 owns y 4 P 12 owns y 5.
29 Parallel Matrix Distribution How to store a (large) matrix using p = r c processes? 1D, block of rows (or columns) Bad idea 1D (block) cyclic, either by rows or columns Bad idea 2D, r c quadrants Not so good 2D (block) cyclic Good idea! References Applet: Introduction to High-Performance Scientific Computing by Victor Eijkhout (free download). A Comprehensive Approach to Parallel Linear Algebra Libraries (Technical Report, University of Texas). Chapter 3, pages / 28
Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms
Algorithms for Collective Communication Design and Analysis of Parallel Algorithms Source A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, Chapter 4, 2003. Outline One-to-all
More informationA quick introduction to MPI (Message Passing Interface)
A quick introduction to MPI (Message Passing Interface) M1IF - APPD Oguz Kaya Pierre Pradic École Normale Supérieure de Lyon, France 1 / 34 Oguz Kaya, Pierre Pradic M1IF - Presentation MPI Introduction
More informationParallel Programming
Parallel Programming Prof. Paolo Bientinesi pauldj@aices.rwth-aachen.de WS 16/17 Communicators MPI_Comm_split( MPI_Comm comm, int color, int key, MPI_Comm* newcomm)
More informationThe Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing
The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF February 9. 2018 1 Recap: Parallelism with MPI An MPI execution is started on a set of processes P with: mpirun -n N
More information2.10 Packing. 59 MPI 2.10 Packing MPI_PACK( INBUF, INCOUNT, DATATYPE, OUTBUF, OUTSIZE, POSITION, COMM, IERR )
59 MPI 2.10 Packing 2.10 Packing MPI_PACK( INBUF, INCOUNT, DATATYPE, OUTBUF, OUTSIZE, POSITION, COMM, IERR ) : : INBUF ( ), OUTBUF( ) INTEGER : : INCOUNT, DATATYPE, OUTSIZE, POSITION, COMM, IERR
More informationParallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and
More informationc 2016 Society for Industrial and Applied Mathematics
SIAM J SCI COMPUT Vol 38, No 6, pp C748 C781 c 2016 Society for Industrial and Applied Mathematics PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK
More information( ) ? ( ) ? ? ? l RB-H
1 2018 4 17 2 30 4 17 17 8 1 1. 4 10 ( ) 2. 4 17 l 3. 4 24 l 4. 5 1 l ( ) 5. 5 8 l 2 6. 5 15 l - 2018 8 6 24 7. 5 22 l 8. 6 5 l - 9. 6 12 l 10. 6 19? l l 11. 7 3? l 12. 7 10? l 13. 7 17? l RB-H 3 4 5 6
More information2018/10/
1 2018 10 2 2 30 10 2 17 8 1 1. 9 25 2. 10 2 l 3. 10 9 l 4. 10 16 l 5. 10 23 l 2 6. 10 30 l - 2019 2 4 24 7. 11 6 l 8. 11 20 l - 9. 11 27 l 10. 12 4 l l 11. 12 11 l 12. 12 18?? l 13. 1 8 l RB-H 3 4 5 6
More informationParallel Program Performance Analysis
Parallel Program Performance Analysis Chris Kauffman CS 499: Spring 2016 GMU Logistics Today Final details of HW2 interviews HW2 timings HW2 Questions Parallel Performance Theory Special Office Hours Mon
More informationLecture 4. Writing parallel programs with MPI Measuring performance
Lecture 4 Writing parallel programs with MPI Measuring performance Announcements Wednesday s office hour moved to 1.30 A new version of Ring (Ring_new) that handles linear sequences of message lengths
More informationModelling and implementation of algorithms in applied mathematics using MPI
Modelling and implementation of algorithms in applied mathematics using MPI Lecture 3: Linear Systems: Simple Iterative Methods and their parallelization, Programming MPI G. Rapin Brazil March 2011 Outline
More informationProgram Performance Metrics
Program Performance Metrics he parallel run time (par) is the time from the moment when computation starts to the moment when the last processor finished his execution he speedup (S) is defined as the
More informationAn Integrative Model for Parallelism
An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale
More informationEECS 358 Introduction to Parallel Computing Final Assignment
EECS 358 Introduction to Parallel Computing Final Assignment Jiangtao Gou Zhenyu Zhao March 19, 2013 1 Problem 1 1.1 Matrix-vector Multiplication on Hypercube and Torus As shown in slide 15.11, we assumed
More informationParallel Programming. Parallel algorithms Linear systems solvers
Parallel Programming Parallel algorithms Linear systems solvers Terminology System of linear equations Solve Ax = b for x Special matrices Upper triangular Lower triangular Diagonally dominant Symmetric
More informationThe Sieve of Erastothenes
The Sieve of Erastothenes Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 25, 2010 José Monteiro (DEI / IST) Parallel and Distributed
More informationOverview: Synchronous Computations
Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous
More informationReview for the Midterm Exam
Review for the Midterm Exam 1 Three Questions of the Computational Science Prelim scaled speedup network topologies work stealing 2 The in-class Spring 2012 Midterm Exam pleasingly parallel computations
More informationPerformance optimization of WEST and Qbox on Intel Knights Landing
Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University
More informationBarrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers
Overview: Synchronous Computations Barrier barriers: linear, tree-based and butterfly degrees of synchronization synchronous example : Jacobi Iterations serial and parallel code, performance analysis synchronous
More informationCEE 618 Scientific Parallel Computing (Lecture 4): Message-Passing Interface (MPI)
1 / 48 CEE 618 Scientific Parallel Computing (Lecture 4): Message-Passing Interface (MPI) Albert S. Kim Department of Civil and Environmental Engineering University of Hawai i at Manoa 2540 Dole Street,
More informationCME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.
CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax
More informationAntonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg
INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The
More informationSolution of Linear Systems
Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 12, 2016 CPD (DEI / IST) Parallel and Distributed Computing
More informationNotation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing
Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE
More informationEfficient algorithms for symmetric tensor contractions
Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to
More informationAlgorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen
Algorithms PART II: Partitioning and Divide & Conquer HPC Fall 2007 Prof. Robert van Engelen Overview Partitioning strategies Divide and conquer strategies Further reading HPC Fall 2007 2 Partitioning
More informationParallel Matrix Factorization for Recommender Systems
Under consideration for publication in Knowledge and Information Systems Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon Department of
More information2.5D algorithms for distributed-memory computing
ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster
More informationPARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY
PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrix matrix
More informationA Computation- and Communication-Optimal Parallel Direct 3-body Algorithm
A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationPARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY
PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrixmatrix
More informationHigh Performance Computing
Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),
More informationLecture 24 - MPI ECE 459: Programming for Performance
ECE 459: Programming for Performance Jon Eyolfson March 9, 2012 What is MPI? Messaging Passing Interface A language-independent communation protocol for parallel computers Run the same code on a number
More informationScalable numerical algorithms for electronic structure calculations
Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster
More informationParallelization of the QC-lib Quantum Computer Simulator Library
Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers
More information4th year Project demo presentation
4th year Project demo presentation Colm Ó héigeartaigh CASE4-99387212 coheig-case4@computing.dcu.ie 4th year Project demo presentation p. 1/23 Table of Contents An Introduction to Quantum Computing The
More informationOverlay networks maximizing throughput
Overlay networks maximizing throughput Olivier Beaumont, Lionel Eyraud-Dubois, Shailesh Kumar Agrawal Cepage team, LaBRI, Bordeaux, France IPDPS April 20, 2010 Outline Introduction 1 Introduction 2 Complexity
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationHYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni
HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI 1989 Sanjay Ranka and Sartaj Sahni 1 2 Chapter 1 Introduction 1.1 Parallel Architectures Parallel computers may
More informationNumerical Linear Algebra
Numerical Linear Algebra By: David McQuilling; Jesus Caban Deng Li Jan.,31,006 CS51 Solving Linear Equations u + v = 8 4u + 9v = 1 A x b 4 9 u v = 8 1 Gaussian Elimination Start with the matrix representation
More informationScalable Non-blocking Preconditioned Conjugate Gradient Methods
Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William
More informationPorting a sphere optimization program from LAPACK to ScaLAPACK
Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference
More informationCyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions
Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More information5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)
5.1 Banded Storage u = temperature u= u h temperature at gridpoints u h = 1 u= Laplace s equation u= h u = u h = grid size u=1 The five-point difference operator 1 u h =1 uh (x + h, y) 2u h (x, y)+u h
More informationReview: From problem to parallel algorithm
Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:
More information0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA
0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA 2008-09 Salvatore Orlando 1 0-1 Knapsack problem N objects, j=1,..,n Each kind of item j has a value p j and a weight w j (single
More informationA Nested Dissection Parallel Direct Solver. for Simulations of 3D DC/AC Resistivity. Measurements. Maciej Paszyński (1,2)
A Nested Dissection Parallel Direct Solver for Simulations of 3D DC/AC Resistivity Measurements Maciej Paszyński (1,2) David Pardo (2), Carlos Torres-Verdín (2) (1) Department of Computer Science, AGH
More informationNotes on MapReduce Algorithms
Notes on MapReduce Algorithms Barna Saha 1 Finding Minimum Spanning Tree of a Dense Graph in MapReduce We are given a graph G = (V, E) on V = N vertices and E = m N 1+c edges for some constant c > 0. Our
More informationSparse linear solvers
Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky
More informationScalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems
Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems Pierre Jolivet, F. Hecht, F. Nataf, C. Prud homme Laboratoire Jacques-Louis Lions Laboratoire Jean Kuntzmann INRIA Rocquencourt
More informationParallelization of the QC-lib Quantum Computer Simulator Library
Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/
More informationParallel Numerics. Scope: Revise standard numerical methods considering parallel computations!
Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:
More informationHierarchical Clock Synchronization in MPI
Hierarchical Clock Synchronization in MPI Sascha Hunold TU Wien, Faculty of Informatics Vienna, Austria Email: hunold@par.tuwien.ac.at Alexandra Carpen-Amarie Fraunhofer ITWM Kaiserslautern, Germany Email:
More informationab initio Electronic Structure Calculations
ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationConquering Edge Faults in a Butterfly with Automorphisms
International Conference on Theoretical and Mathematical Foundations of Computer Science (TMFCS-0) Conquering Edge Faults in a Butterfly with Automorphisms Meghanad D. Wagh and Khadidja Bendjilali Department
More informationPracticality of Large Scale Fast Matrix Multiplication
Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar
More informationEfficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism
Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche Department of Computer Science University of Warwick June 2006 Outline 1 Introduction Motivation The BSP
More informationChapter 5. Divide and Conquer CLRS 4.3. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.
Chapter 5 Divide and Conquer CLRS 4.3 Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. 1 Divide-and-Conquer Divide-and-conquer. Break up problem into several parts. Solve
More informationDistributed Memory Parallelization in NGSolve
Distributed Memory Parallelization in NGSolve Lukas Kogler June, 2017 Inst. for Analysis and Scientific Computing, TU Wien From Shared to Distributed Memory Shared Memory Parallelization via threads (
More informationComputing Maximum Subsequence in Parallel
Computing Maximum Subsequence in Parallel C. E. R. Alves 1, E. N. Cáceres 2, and S. W. Song 3 1 Universidade São Judas Tadeu, São Paulo, SP - Brazil, prof.carlos r alves@usjt.br 2 Universidade Federal
More informationA Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices*
A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* Fredrik Manne Department of Informatics, University of Bergen, N-5020 Bergen, Norway Fredrik. Manne@ii. uib. no
More informationParallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano
Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic
More informationComputational Numerical Integration for Spherical Quadratures. Verified by the Boltzmann Equation
Computational Numerical Integration for Spherical Quadratures Verified by the Boltzmann Equation Huston Rogers. 1 Glenn Brook, Mentor 2 Greg Peterson, Mentor 2 1 The University of Alabama 2 Joint Institute
More information! Break up problem into several parts. ! Solve each part recursively. ! Combine solutions to sub-problems into overall solution.
Divide-and-Conquer Chapter 5 Divide and Conquer Divide-and-conquer.! Break up problem into several parts.! Solve each part recursively.! Combine solutions to sub-problems into overall solution. Most common
More informationDivisible Job Scheduling in Systems with Limited Memory. Paweł Wolniewicz
Divisible Job Scheduling in Systems with Limited Memory Paweł Wolniewicz 2003 Table of Contents Table of Contents 1 1 Introduction 3 2 Divisible Job Model Fundamentals 6 2.1 Formulation of the problem.......................
More informationParallel Algorithms for the Solution of Toeplitz Systems of Linear Equations
Parallel Algorithms for the Solution of Toeplitz Systems of Linear Equations Pedro Alonso 1, José M. Badía 2, and Antonio M. Vidal 1 1 Departamento de Sistemas Informáticos y Computación, Universidad Politécnica
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at
More informationClustering algorithms distributed over a Cloud Computing Platform.
Clustering algorithms distributed over a Cloud Computing Platform. SEPTEMBER 28 TH 2012 Ph. D. thesis supervised by Pr. Fabrice Rossi. Matthieu Durut (Telecom/Lokad) 1 / 55 Outline. 1 Introduction to Cloud
More informationCyclops Tensor Framework
Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r
More informationThe Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers
The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers ANSHUL GUPTA and FRED G. GUSTAVSON IBM T. J. Watson Research Center MAHESH JOSHI
More informationCommunication avoiding parallel algorithms for dense matrix factorizations
Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013
More informationA New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem
A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem Dmitry Korkin This work introduces a new parallel algorithm for computing a multiple longest common subsequence
More informationDID: Distributed Incremental Block Coordinate Descent for Nonnegative Matrix Factorization
DID: Distributed Incremental Block Coordinate Descent for Nonnegative Matrix Factorization Tianxiang Gao, Chris Chu Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, 500,
More informationFast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array With a Reconfigurable Pipelined Bus System
IEEE TRASACTIOS O PARALLEL AD DISTRIBUTED SYSTEMS, VOL. 9, O. 8, AUGUST 1998 705 Fast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array With a Reconfigurable Pipelined
More informationAlgebraic. techniques1
techniques Algebraic An electrician, a bank worker, a plumber and so on all have tools of their trade. Without these tools, and a good working knowledge of how to use them, it would be impossible for them
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationTribhuvan University Institute of Science and Technology 2067
11CSc. MTH. -2067 Tribhuvan University Institute of Science and Technology 2067 Bachelor Level/First Year/ Second Semester/ Science Full Marks: 80 Computer Science and Information Technology Pass Marks:
More informationThe Performance Evolution of the Parallel Ocean Program on the Cray X1
The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott
More informationSolving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *
Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * J.M. Badía and A.M. Vidal Dpto. Informática., Univ Jaume I. 07, Castellón, Spain. badia@inf.uji.es Dpto. Sistemas Informáticos y Computación.
More informationMinisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially
More informationTiming Results of a Parallel FFTsynth
Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1994 Timing Results of a Parallel FFTsynth Robert E. Lynch Purdue University, rel@cs.purdue.edu
More informationTradeoffs between synchronization, communication, and work in parallel linear algebra computations
Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel Department of EECS, UC Berkeley February,
More informationScientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix
Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008
More informationStrassen s Algorithm for Tensor Contraction
Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,
More informationDivisible Load Scheduling
Divisible Load Scheduling Henri Casanova 1,2 1 Associate Professor Department of Information and Computer Science University of Hawai i at Manoa, U.S.A. 2 Visiting Associate Professor National Institute
More informationChapter 3 - From Gaussian Elimination to LU Factorization
Chapter 3 - From Gaussian Elimination to LU Factorization Maggie Myers Robert A. van de Geijn The University of Texas at Austin Practical Linear Algebra Fall 29 http://z.cs.utexas.edu/wiki/pla.wiki/ 1
More informationImage Reconstruction And Poisson s equation
Chapter 1, p. 1/58 Image Reconstruction And Poisson s equation School of Engineering Sciences Parallel s for Large-Scale Problems I Chapter 1, p. 2/58 Outline 1 2 3 4 Chapter 1, p. 3/58 Question What have
More informationModel Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University
Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul
More informationLecture 4: Linear Algebra 1
Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation
More information2. One-To-All Broadcast and All-To-One Reduction. 1. Chapter 4 : Efficient Collective Communication
1. Chater : Efficient Collective Communication Collective communication: comm amongst collection of nodes (not just sender & recver. One-to-all (bcast, all-to-one (reduc, all-to-all, scatter/gather, etc.
More informationV. Adamchik 1. Recurrences. Victor Adamchik Fall of 2005
V. Adamchi Recurrences Victor Adamchi Fall of 00 Plan Multiple roots. More on multiple roots. Inhomogeneous equations 3. Divide-and-conquer recurrences In the previous lecture we have showed that if the
More informationDetermine the size of an instance of the minimum spanning tree problem.
3.1 Algorithm complexity Consider two alternative algorithms A and B for solving a given problem. Suppose A is O(n 2 ) and B is O(2 n ), where n is the size of the instance. Let n A 0 be the size of the
More informationCALU: A Communication Optimal LU Factorization Algorithm
CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9
More information