1 / 28. Parallel Programming.

Size: px
Start display at page:

Download "1 / 28. Parallel Programming."

Transcription

1 1 / 28 Parallel Programming pauldj@aices.rwth-aachen.de

2 Collective Communication 2 / 28 Barrier Broadcast Reduce Scatter Gather Allgather Reduce-scatter Allreduce Alltoall. References Collective Communication: Theory, Practice, and Experience, Chan, Heimlich, Purkayastha, van de Geijn. (FLAME working note #22) Collective Communications in MPI

3 Collective Communication 3 / 28 Synchronization Barrier Almost never needed! Data Movement Broadcast, Scatter, Gather, Allgather, Alltoall Reductions Reduce, Reduce-scatter, Allreduce, Scan,... For all collectives: no tags; blocking.

4 4 / 28 int MPI_BCast(...) Before: α After: α α α α

5 5 / 28 int MPI_Reduce(...) Before: δ 0 δ 1 δ 2 δ 3 After: p 1 Op i=0 δ i MPI_Op: MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND,..., MPI_Datatype: MPI_CHAR, MPI_INT, MPI_UNSIGNED, MPI_FLOAT, MPI_DOUBLE,...

6 6 / 28 int MPI_Scatter(...) Before: v[0] v[1] v[2] v[3] After: v[0] v[1] v[2] v[3]

7 7 / 28 int MPI_Gather(...) Before: v[0] v[1] v[2] v[3] After: v[0] v[1] v[2] v[3]

8 8 / 28 int MPI_Allgather(...) Before: v[0] v[1] v[2] v[3] After: v[0] v[0] v[0] v[0] v[1] v[1] v[1] v[1] v[2] v[2] v[2] v[2] v[3] v[3] v[3] v[3]

9 9 / 28 int MPI_Reduce_scatter(...) Before: v 0[0] v 1[0] v 2[0] v 3[0] v 0[1] v 1[1] v 2[1] v 3[1] v 0[2] v 1[2] v 2[2] v 3[2] v 0[3] v 1[3] v 2[3] v 3[3] After: Op v i[0] i Op v i[1] i Op v i[2] i Op v i[3] i

10 10 / 28 int MPI_Allreduce(...) Before: δ 0 δ 1 δ 2 δ 3 After: p 1 Op δ i i=0 p 1 Op δ i i=0 p 1 Op δ i i=0 p 1 Op δ i i=0

11 11 / 28 int MPI_Alltoall(...) Before: v 0[0] v 1[0] v 2[0] v 3[0] v 0[1] v 1[1] v 2[1] v 3[1] v 0[2] v 1[2] v 2[2] v 3[2] v 0[3] v 1[3] v 2[3] v 3[3] After: v 0[0] v 0[1] v 0[2] v 0[3] v 1[0] v 1[1] v 1[2] v 1[3] v 2[0] v 2[1] v 2[2] v 2[3] v 3[0] v 3[1] v 3[2] v 3[3]

12 More Collectives 12 / 28 Variable length MPI_Scatterv MPI_Gatherv MPI_Allgatherv MPI_Alltoallv Partial reduction MPI_Scan Non-blocking collectives MPI_I* Multiple communicators Example 1) World (everybody) + Workers (everybody-masters) + Masters (everybody-workers) Example 2) r c mesh of processes; communication within rows/columns only MPI_Comm_create(...) MPI_Comm_split(...)...

13 More Collectives More Communicators 12 / 28 Variable length MPI_Scatterv MPI_Gatherv MPI_Allgatherv MPI_Alltoallv Partial reduction MPI_Scan Non-blocking collectives MPI_I* Multiple communicators Example 1) World (everybody) + Workers (everybody-masters) + Masters (everybody-workers) Example 2) r c mesh of processes; communication within rows/columns only MPI_Comm_create(...) MPI_Comm_split(...)...

14 Collective Communication: Lower Bounds 13 / 28 Cost of communication: α + nβ Cost of computation: γ #ops α = latency, startup n = length of the message p = # of processes β = 1/ bandwidth γ = cost of 1 flop Primitive Latency Bandwidth Computation Broadcast log 2 (p) α nβ - p 1 Reduce log 2 (p) α nβ p nγ p 1 Scatter log 2 (p) α p nβ - p 1 Gather log 2 (p) α p nβ - p 1 Allgather log 2 (p) α p nβ - p 1 Reduce-Scatter log 2 (p) α p nβ p 1 p

15 Implementation of Bcast and Reduce 14 / 28 IDEA: recursive doubling / Minimum Spanning Tree (MST) At each step, double the number of active processes. How to map the idea to the specific topology? ring: linear doubling (2d) mesh: 1 dimension first, then another, then another... hypercube: obvious, same as mesh Cost? # steps: log 2 p cost(step): α + nβ total time: log 2 (p)α + log 2 (p)nβ lower bound: log 2 (p)α + nβ note: cost(p 2 ) = 2 cost(p)! Reduce BCast in reverse; cost(computation)?

16 Implementation of Gather (and Scatter) 15 / 28 IDEA: MST again At step i, only 1 2 i -th of the message is sent # steps: log 2 p cost(step i ): α + n 2 i β total time: log 2 (p) i=1 α + n 2 i β = log 2(p)α + p 1 p nβ lower bound: log 2 (p)α + p 1 p nβ optimal!

17 Implementation of Allgather (and Reduce-scatter) IDEA: Recursive-doubling (bidirectional exchange) Recursive allgather of half data + exchange data between disjoint nodes. v[0] v[1] v[2] v[3] v[0] v[1] v[0] v[1] v[2] v[3] v[2] v[3] # steps: log 2 p total time: log 2 (p) i=1 α + n 2 β = log i 2 (p)α + p 1 p nβ v[0] v[0] v[0] v[0] v[1] v[1] v[1] v[1] v[2] v[2] v[2] v[2] v[3] v[3] v[3] v[3] 16 / 28

18 Another Implementation of Allgather 17 / 28 IDEA: Cyclic algorithm v[0] v[1] v[2] v[3] v[0] v[3] v[0] v[1] v[1] v[2] v[2] v[3] # steps: p 1 total time: p 1 α + n p 1 β = (p 1)α + p p nβ i=1 v[0] v[0] v[0] v[1] v[1] v[1] v[2] v[2] v[2] v[3] v[3] v[3]

19 Another Implementation of Bcast 18 / 28 IDEA: Scatter + cyclic algorithm

20 Dense Matrix-Vector Product: Scalability 19 / 28 1D matrix distribution y := Ax, x, y R n, A R n n A is partitioned by rows and distributed over p processes. Process i owns the block of rows A i. Similarly for x: process i owns x i. Goal: Compute y and distribute it the same way as x. A = A 0 A 1., x = x 0 x 1., and y = y 0 y 1. A p 1 x p 1 y p 1 Note: A i, x i and y i indicate a block of rows, as opposed to a single one.

21 Dense Matrix-Vector Product: Scalability 1D matrix distribution Algorithm 1 x = Allgather(x i ) x becomes available to every process 2 y i = A i x local computation Parallel cost (lower bound for T p (n)) 1 log 2 (p) α + p 1 p nβ log 2(p)α + nβ 2 2 n2 p γ Sequential cost T 1 (n) = 2n 2 γ 20 / 28

22 GEMV: Scalability 21 / 28 1D matrix distribution Speedup Efficiency Strong scalability Weak scalability S p (n) = T 1(n) T p (n) = E p (n) = S p(n) p = lim E p(n) = 0 p 2n 2 γ log 2 (p)α + nβ + 2 n2 p γ p log 2 (p) α 2n 2 γ + p β 2n γ M = local memory; Mp = combined memory; n M = largest problem solvable with p processes n 2 M = Mp lim E p(n M ) = p log 2 (p) α 2M γ + p 2 β M γ = 0

23 Exercise: y = Ax, A distributed by columns 22 / 28 A is partitioned by columns and distributed over p processes. Process i owns the block of columns A i. Process i owns x i. Goal: Compute y and distribute it the same way as x. A = A 0 A 1... A p 1, x = x 0 x 1., and y = y 0 y 1. x p 1 y p 1

24 Scalability Algorithm 1 y (i) = A i x i local computation 2 y = Reduce-Scatter(y (i) ) reduction-sum of y (i) s + scatter Parallel cost 1 2 n2 p γ 2 log 2 (p) α + p 1 p p 1 nβ + p nγ log 2(p)α + n(β + γ) 3 Lower bound for T p (n) = log 2 (p)α + n(β + γ) + 2 n2 p γ T p (n) has one extra term (nγ) with respect to the case of A partitioned by rows, therefore this algorithm is also not scalable. 23 / 28

25 GEMV: Scalability 24 / 28 2D matrix distribution A is partitioned according to a 2D p = r c mesh; The (i, j) process (P ij ) owns the block A ij. x and y are partitioned in p chunks; x is mapped to the mesh by colums, y by rows. Example: p = 2 3 [ A00 A A = 01 A 02 A 10 A 11 A 12 ], x = x 0 x 1. x 5, and y = y 0 y 1. y 5 P ij owns A ij, P 00 owns x 0 P 10 owns x 1 P 01 owns x 2, and P 11 owns x 3 P 02 owns x 4 P 12 owns x 5 P 00 owns y 0 P 01 owns y 1 P 02 owns y 2 P 10 owns y 3 P 11 owns y 4 P 12 owns y 5.

26 GEMV: Scalability 25 / 28 2D matrix distribution Algorithm 1 x I = Allgather(x i) within columns x I is a block 2 y J = A ijx I local computation 3 y j = Reduce-scatter(y J ) within rows Parallel cost (lower bound for T p(n)) 1 log 2 (r) α + r 1 nβ log p 2 (r)α + n β c 2 2 n2 p γ 3 log 2 (c) α + c 1 p nβ + c 1 nγ log 2 (c)α + n β + n γ r r p Sequential cost T 1(n) = 2n 2 γ

27 GEMV: Scalability 26 / 28 2D matrix distribution Parallel cost: assuming r = c = p: Speedup S p (n) = ( 2 n2 n p γ + (log 2(r) + log 2 (c)) α + c + n ) β + n r r γ 2 n2 p γ + log 2(p)α + n p (2β + γ) p 1 + p log 2 (p) α 2n 2 Efficiency E p (n) = S p (n)/p =... γ + p 2n 2β+γ γ Strong scalability lim E p(n) = 0 p Weak scalability M = local memory; Mp = combined memory; n M = largest problem solvable with p processes n 2 M = Mp lim E p(n M ) = p log 2 (p) α 2M γ β+γ M γ = 0...

28 27 / 28 Exercise 2D matrix distribution A is partitioned according to a 2D p = r c mesh; The (i, j) process (P ij ) owns the block A ij. Both x and y are partitioned in p chunks, and mapped to the mesh by colums. Example: p = 2 3 [ A00 A A = 01 A 02 A 10 A 11 A 12 ], x = x 0 x 1. x 5, and y = y 0 y 1. y 5 P ij owns A ij, P 00 owns x 0 P 10 owns x 1 P 01 owns x 2, and P 11 owns x 3 P 02 owns x 4 P 12 owns x 5 P 00 owns y 0 P 10 owns y 1 P 01 owns y 2 P 11 owns y 3 P 02 owns y 4 P 12 owns y 5.

29 Parallel Matrix Distribution How to store a (large) matrix using p = r c processes? 1D, block of rows (or columns) Bad idea 1D (block) cyclic, either by rows or columns Bad idea 2D, r c quadrants Not so good 2D (block) cyclic Good idea! References Applet: Introduction to High-Performance Scientific Computing by Victor Eijkhout (free download). A Comprehensive Approach to Parallel Linear Algebra Libraries (Technical Report, University of Texas). Chapter 3, pages / 28

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms Algorithms for Collective Communication Design and Analysis of Parallel Algorithms Source A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, Chapter 4, 2003. Outline One-to-all

More information

A quick introduction to MPI (Message Passing Interface)

A quick introduction to MPI (Message Passing Interface) A quick introduction to MPI (Message Passing Interface) M1IF - APPD Oguz Kaya Pierre Pradic École Normale Supérieure de Lyon, France 1 / 34 Oguz Kaya, Pierre Pradic M1IF - Presentation MPI Introduction

More information

Parallel Programming

Parallel Programming Parallel Programming Prof. Paolo Bientinesi pauldj@aices.rwth-aachen.de WS 16/17 Communicators MPI_Comm_split( MPI_Comm comm, int color, int key, MPI_Comm* newcomm)

More information

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF February 9. 2018 1 Recap: Parallelism with MPI An MPI execution is started on a set of processes P with: mpirun -n N

More information

2.10 Packing. 59 MPI 2.10 Packing MPI_PACK( INBUF, INCOUNT, DATATYPE, OUTBUF, OUTSIZE, POSITION, COMM, IERR )

2.10 Packing. 59 MPI 2.10 Packing MPI_PACK( INBUF, INCOUNT, DATATYPE, OUTBUF, OUTSIZE, POSITION, COMM, IERR ) 59 MPI 2.10 Packing 2.10 Packing MPI_PACK( INBUF, INCOUNT, DATATYPE, OUTBUF, OUTSIZE, POSITION, COMM, IERR ) : : INBUF ( ), OUTBUF( ) INTEGER : : INCOUNT, DATATYPE, OUTSIZE, POSITION, COMM, IERR

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

c 2016 Society for Industrial and Applied Mathematics

c 2016 Society for Industrial and Applied Mathematics SIAM J SCI COMPUT Vol 38, No 6, pp C748 C781 c 2016 Society for Industrial and Applied Mathematics PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK

More information

( ) ? ( ) ? ? ? l RB-H

( ) ? ( ) ? ? ? l RB-H 1 2018 4 17 2 30 4 17 17 8 1 1. 4 10 ( ) 2. 4 17 l 3. 4 24 l 4. 5 1 l ( ) 5. 5 8 l 2 6. 5 15 l - 2018 8 6 24 7. 5 22 l 8. 6 5 l - 9. 6 12 l 10. 6 19? l l 11. 7 3? l 12. 7 10? l 13. 7 17? l RB-H 3 4 5 6

More information

2018/10/

2018/10/ 1 2018 10 2 2 30 10 2 17 8 1 1. 9 25 2. 10 2 l 3. 10 9 l 4. 10 16 l 5. 10 23 l 2 6. 10 30 l - 2019 2 4 24 7. 11 6 l 8. 11 20 l - 9. 11 27 l 10. 12 4 l l 11. 12 11 l 12. 12 18?? l 13. 1 8 l RB-H 3 4 5 6

More information

Parallel Program Performance Analysis

Parallel Program Performance Analysis Parallel Program Performance Analysis Chris Kauffman CS 499: Spring 2016 GMU Logistics Today Final details of HW2 interviews HW2 timings HW2 Questions Parallel Performance Theory Special Office Hours Mon

More information

Lecture 4. Writing parallel programs with MPI Measuring performance

Lecture 4. Writing parallel programs with MPI Measuring performance Lecture 4 Writing parallel programs with MPI Measuring performance Announcements Wednesday s office hour moved to 1.30 A new version of Ring (Ring_new) that handles linear sequences of message lengths

More information

Modelling and implementation of algorithms in applied mathematics using MPI

Modelling and implementation of algorithms in applied mathematics using MPI Modelling and implementation of algorithms in applied mathematics using MPI Lecture 3: Linear Systems: Simple Iterative Methods and their parallelization, Programming MPI G. Rapin Brazil March 2011 Outline

More information

Program Performance Metrics

Program Performance Metrics Program Performance Metrics he parallel run time (par) is the time from the moment when computation starts to the moment when the last processor finished his execution he speedup (S) is defined as the

More information

An Integrative Model for Parallelism

An Integrative Model for Parallelism An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale

More information

EECS 358 Introduction to Parallel Computing Final Assignment

EECS 358 Introduction to Parallel Computing Final Assignment EECS 358 Introduction to Parallel Computing Final Assignment Jiangtao Gou Zhenyu Zhao March 19, 2013 1 Problem 1 1.1 Matrix-vector Multiplication on Hypercube and Torus As shown in slide 15.11, we assumed

More information

Parallel Programming. Parallel algorithms Linear systems solvers

Parallel Programming. Parallel algorithms Linear systems solvers Parallel Programming Parallel algorithms Linear systems solvers Terminology System of linear equations Solve Ax = b for x Special matrices Upper triangular Lower triangular Diagonally dominant Symmetric

More information

The Sieve of Erastothenes

The Sieve of Erastothenes The Sieve of Erastothenes Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 25, 2010 José Monteiro (DEI / IST) Parallel and Distributed

More information

Overview: Synchronous Computations

Overview: Synchronous Computations Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous

More information

Review for the Midterm Exam

Review for the Midterm Exam Review for the Midterm Exam 1 Three Questions of the Computational Science Prelim scaled speedup network topologies work stealing 2 The in-class Spring 2012 Midterm Exam pleasingly parallel computations

More information

Performance optimization of WEST and Qbox on Intel Knights Landing

Performance optimization of WEST and Qbox on Intel Knights Landing Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University

More information

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers Overview: Synchronous Computations Barrier barriers: linear, tree-based and butterfly degrees of synchronization synchronous example : Jacobi Iterations serial and parallel code, performance analysis synchronous

More information

CEE 618 Scientific Parallel Computing (Lecture 4): Message-Passing Interface (MPI)

CEE 618 Scientific Parallel Computing (Lecture 4): Message-Passing Interface (MPI) 1 / 48 CEE 618 Scientific Parallel Computing (Lecture 4): Message-Passing Interface (MPI) Albert S. Kim Department of Civil and Environmental Engineering University of Hawai i at Manoa 2540 Dole Street,

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The

More information

Solution of Linear Systems

Solution of Linear Systems Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 12, 2016 CPD (DEI / IST) Parallel and Distributed Computing

More information

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

Algorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen

Algorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen Algorithms PART II: Partitioning and Divide & Conquer HPC Fall 2007 Prof. Robert van Engelen Overview Partitioning strategies Divide and conquer strategies Further reading HPC Fall 2007 2 Partitioning

More information

Parallel Matrix Factorization for Recommender Systems

Parallel Matrix Factorization for Recommender Systems Under consideration for publication in Knowledge and Information Systems Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon Department of

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrix matrix

More information

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY

PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY PARALLEL MATRIX MULTIPLICATION: A SYSTEMATIC JOURNEY MARTIN D SCHATZ, ROBERT A VAN DE GEIJN, AND JACK POULSON Abstract We exose a systematic aroach for develoing distributed memory arallel matrixmatrix

More information

High Performance Computing

High Performance Computing Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),

More information

Lecture 24 - MPI ECE 459: Programming for Performance

Lecture 24 - MPI ECE 459: Programming for Performance ECE 459: Programming for Performance Jon Eyolfson March 9, 2012 What is MPI? Messaging Passing Interface A language-independent communation protocol for parallel computers Run the same code on a number

More information

Scalable numerical algorithms for electronic structure calculations

Scalable numerical algorithms for electronic structure calculations Scalable numerical algorithms for electronic structure calculations Edgar Solomonik C Berkeley July, 2012 Edgar Solomonik Cyclops Tensor Framework 1/ 73 Outline Introduction Motivation: Coupled Cluster

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers

More information

4th year Project demo presentation

4th year Project demo presentation 4th year Project demo presentation Colm Ó héigeartaigh CASE4-99387212 coheig-case4@computing.dcu.ie 4th year Project demo presentation p. 1/23 Table of Contents An Introduction to Quantum Computing The

More information

Overlay networks maximizing throughput

Overlay networks maximizing throughput Overlay networks maximizing throughput Olivier Beaumont, Lionel Eyraud-Dubois, Shailesh Kumar Agrawal Cepage team, LaBRI, Bordeaux, France IPDPS April 20, 2010 Outline Introduction 1 Introduction 2 Complexity

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI 1989 Sanjay Ranka and Sartaj Sahni 1 2 Chapter 1 Introduction 1.1 Parallel Architectures Parallel computers may

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra By: David McQuilling; Jesus Caban Deng Li Jan.,31,006 CS51 Solving Linear Equations u + v = 8 4u + 9v = 1 A x b 4 9 u v = 8 1 Gaussian Elimination Start with the matrix representation

More information

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Scalable Non-blocking Preconditioned Conjugate Gradient Methods Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William

More information

Porting a sphere optimization program from LAPACK to ScaLAPACK

Porting a sphere optimization program from LAPACK to ScaLAPACK Porting a sphere optimization program from LAPACK to ScaLAPACK Mathematical Sciences Institute, Australian National University. For presentation at Computational Techniques and Applications Conference

More information

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y) 5.1 Banded Storage u = temperature u= u h temperature at gridpoints u h = 1 u= Laplace s equation u= h u = u h = grid size u=1 The five-point difference operator 1 u h =1 uh (x + h, y) 2u h (x, y)+u h

More information

Review: From problem to parallel algorithm

Review: From problem to parallel algorithm Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:

More information

0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA

0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA 0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA 2008-09 Salvatore Orlando 1 0-1 Knapsack problem N objects, j=1,..,n Each kind of item j has a value p j and a weight w j (single

More information

A Nested Dissection Parallel Direct Solver. for Simulations of 3D DC/AC Resistivity. Measurements. Maciej Paszyński (1,2)

A Nested Dissection Parallel Direct Solver. for Simulations of 3D DC/AC Resistivity. Measurements. Maciej Paszyński (1,2) A Nested Dissection Parallel Direct Solver for Simulations of 3D DC/AC Resistivity Measurements Maciej Paszyński (1,2) David Pardo (2), Carlos Torres-Verdín (2) (1) Department of Computer Science, AGH

More information

Notes on MapReduce Algorithms

Notes on MapReduce Algorithms Notes on MapReduce Algorithms Barna Saha 1 Finding Minimum Spanning Tree of a Dense Graph in MapReduce We are given a graph G = (V, E) on V = N vertices and E = m N 1+c edges for some constant c > 0. Our

More information

Sparse linear solvers

Sparse linear solvers Sparse linear solvers Laura Grigori ALPINES INRIA and LJLL, UPMC On sabbatical at UC Berkeley March 2015 Plan Sparse linear solvers Sparse matrices and graphs Classes of linear solvers Sparse Cholesky

More information

Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems

Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems Pierre Jolivet, F. Hecht, F. Nataf, C. Prud homme Laboratoire Jacques-Louis Lions Laboratoire Jean Kuntzmann INRIA Rocquencourt

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

Hierarchical Clock Synchronization in MPI

Hierarchical Clock Synchronization in MPI Hierarchical Clock Synchronization in MPI Sascha Hunold TU Wien, Faculty of Informatics Vienna, Austria Email: hunold@par.tuwien.ac.at Alexandra Carpen-Amarie Fraunhofer ITWM Kaiserslautern, Germany Email:

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Conquering Edge Faults in a Butterfly with Automorphisms

Conquering Edge Faults in a Butterfly with Automorphisms International Conference on Theoretical and Mathematical Foundations of Computer Science (TMFCS-0) Conquering Edge Faults in a Butterfly with Automorphisms Meghanad D. Wagh and Khadidja Bendjilali Department

More information

Practicality of Large Scale Fast Matrix Multiplication

Practicality of Large Scale Fast Matrix Multiplication Practicality of Large Scale Fast Matrix Multiplication Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz and Oded Schwartz UC Berkeley IWASEP June 5, 2012 Napa Valley, CA Research supported by

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Edgar

More information

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche Department of Computer Science University of Warwick June 2006 Outline 1 Introduction Motivation The BSP

More information

Chapter 5. Divide and Conquer CLRS 4.3. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Chapter 5. Divide and Conquer CLRS 4.3. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. Chapter 5 Divide and Conquer CLRS 4.3 Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved. 1 Divide-and-Conquer Divide-and-conquer. Break up problem into several parts. Solve

More information

Distributed Memory Parallelization in NGSolve

Distributed Memory Parallelization in NGSolve Distributed Memory Parallelization in NGSolve Lukas Kogler June, 2017 Inst. for Analysis and Scientific Computing, TU Wien From Shared to Distributed Memory Shared Memory Parallelization via threads (

More information

Computing Maximum Subsequence in Parallel

Computing Maximum Subsequence in Parallel Computing Maximum Subsequence in Parallel C. E. R. Alves 1, E. N. Cáceres 2, and S. W. Song 3 1 Universidade São Judas Tadeu, São Paulo, SP - Brazil, prof.carlos r alves@usjt.br 2 Universidade Federal

More information

A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices*

A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* Fredrik Manne Department of Informatics, University of Bergen, N-5020 Bergen, Norway Fredrik. Manne@ii. uib. no

More information

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic

More information

Computational Numerical Integration for Spherical Quadratures. Verified by the Boltzmann Equation

Computational Numerical Integration for Spherical Quadratures. Verified by the Boltzmann Equation Computational Numerical Integration for Spherical Quadratures Verified by the Boltzmann Equation Huston Rogers. 1 Glenn Brook, Mentor 2 Greg Peterson, Mentor 2 1 The University of Alabama 2 Joint Institute

More information

! Break up problem into several parts. ! Solve each part recursively. ! Combine solutions to sub-problems into overall solution.

! Break up problem into several parts. ! Solve each part recursively. ! Combine solutions to sub-problems into overall solution. Divide-and-Conquer Chapter 5 Divide and Conquer Divide-and-conquer.! Break up problem into several parts.! Solve each part recursively.! Combine solutions to sub-problems into overall solution. Most common

More information

Divisible Job Scheduling in Systems with Limited Memory. Paweł Wolniewicz

Divisible Job Scheduling in Systems with Limited Memory. Paweł Wolniewicz Divisible Job Scheduling in Systems with Limited Memory Paweł Wolniewicz 2003 Table of Contents Table of Contents 1 1 Introduction 3 2 Divisible Job Model Fundamentals 6 2.1 Formulation of the problem.......................

More information

Parallel Algorithms for the Solution of Toeplitz Systems of Linear Equations

Parallel Algorithms for the Solution of Toeplitz Systems of Linear Equations Parallel Algorithms for the Solution of Toeplitz Systems of Linear Equations Pedro Alonso 1, José M. Badía 2, and Antonio M. Vidal 1 1 Departamento de Sistemas Informáticos y Computación, Universidad Politécnica

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at

More information

Clustering algorithms distributed over a Cloud Computing Platform.

Clustering algorithms distributed over a Cloud Computing Platform. Clustering algorithms distributed over a Cloud Computing Platform. SEPTEMBER 28 TH 2012 Ph. D. thesis supervised by Pr. Fabrice Rossi. Matthieu Durut (Telecom/Lokad) 1 / 55 Outline. 1 Introduction to Cloud

More information

Cyclops Tensor Framework

Cyclops Tensor Framework Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r

More information

The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers

The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers ANSHUL GUPTA and FRED G. GUSTAVSON IBM T. J. Watson Research Center MAHESH JOSHI

More information

Communication avoiding parallel algorithms for dense matrix factorizations

Communication avoiding parallel algorithms for dense matrix factorizations Communication avoiding parallel dense matrix factorizations 1/ 44 Communication avoiding parallel algorithms for dense matrix factorizations Edgar Solomonik Department of EECS, UC Berkeley October 2013

More information

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem Dmitry Korkin This work introduces a new parallel algorithm for computing a multiple longest common subsequence

More information

DID: Distributed Incremental Block Coordinate Descent for Nonnegative Matrix Factorization

DID: Distributed Incremental Block Coordinate Descent for Nonnegative Matrix Factorization DID: Distributed Incremental Block Coordinate Descent for Nonnegative Matrix Factorization Tianxiang Gao, Chris Chu Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, 500,

More information

Fast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array With a Reconfigurable Pipelined Bus System

Fast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array With a Reconfigurable Pipelined Bus System IEEE TRASACTIOS O PARALLEL AD DISTRIBUTED SYSTEMS, VOL. 9, O. 8, AUGUST 1998 705 Fast and Processor Efficient Parallel Matrix Multiplication Algorithms on a Linear Array With a Reconfigurable Pipelined

More information

Algebraic. techniques1

Algebraic. techniques1 techniques Algebraic An electrician, a bank worker, a plumber and so on all have tools of their trade. Without these tools, and a good working knowledge of how to use them, it would be impossible for them

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

Tribhuvan University Institute of Science and Technology 2067

Tribhuvan University Institute of Science and Technology 2067 11CSc. MTH. -2067 Tribhuvan University Institute of Science and Technology 2067 Bachelor Level/First Year/ Second Semester/ Science Full Marks: 80 Computer Science and Information Technology Pass Marks:

More information

The Performance Evolution of the Parallel Ocean Program on the Cray X1

The Performance Evolution of the Parallel Ocean Program on the Cray X1 The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott

More information

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * J.M. Badía and A.M. Vidal Dpto. Informática., Univ Jaume I. 07, Castellón, Spain. badia@inf.uji.es Dpto. Sistemas Informáticos y Computación.

More information

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu Motivation (1) Increasing parallelism to exploit From Top500 to multicores in your laptop Exponentially

More information

Timing Results of a Parallel FFTsynth

Timing Results of a Parallel FFTsynth Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1994 Timing Results of a Parallel FFTsynth Robert E. Lynch Purdue University, rel@cs.purdue.edu

More information

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations

Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Tradeoffs between synchronization, communication, and work in parallel linear algebra computations Edgar Solomonik, Erin Carson, Nicholas Knight, and James Demmel Department of EECS, UC Berkeley February,

More information

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008

More information

Strassen s Algorithm for Tensor Contraction

Strassen s Algorithm for Tensor Contraction Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,

More information

Divisible Load Scheduling

Divisible Load Scheduling Divisible Load Scheduling Henri Casanova 1,2 1 Associate Professor Department of Information and Computer Science University of Hawai i at Manoa, U.S.A. 2 Visiting Associate Professor National Institute

More information

Chapter 3 - From Gaussian Elimination to LU Factorization

Chapter 3 - From Gaussian Elimination to LU Factorization Chapter 3 - From Gaussian Elimination to LU Factorization Maggie Myers Robert A. van de Geijn The University of Texas at Austin Practical Linear Algebra Fall 29 http://z.cs.utexas.edu/wiki/pla.wiki/ 1

More information

Image Reconstruction And Poisson s equation

Image Reconstruction And Poisson s equation Chapter 1, p. 1/58 Image Reconstruction And Poisson s equation School of Engineering Sciences Parallel s for Large-Scale Problems I Chapter 1, p. 2/58 Outline 1 2 3 4 Chapter 1, p. 3/58 Question What have

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

Lecture 4: Linear Algebra 1

Lecture 4: Linear Algebra 1 Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation

More information

2. One-To-All Broadcast and All-To-One Reduction. 1. Chapter 4 : Efficient Collective Communication

2. One-To-All Broadcast and All-To-One Reduction. 1. Chapter 4 : Efficient Collective Communication 1. Chater : Efficient Collective Communication Collective communication: comm amongst collection of nodes (not just sender & recver. One-to-all (bcast, all-to-one (reduc, all-to-all, scatter/gather, etc.

More information

V. Adamchik 1. Recurrences. Victor Adamchik Fall of 2005

V. Adamchik 1. Recurrences. Victor Adamchik Fall of 2005 V. Adamchi Recurrences Victor Adamchi Fall of 00 Plan Multiple roots. More on multiple roots. Inhomogeneous equations 3. Divide-and-conquer recurrences In the previous lecture we have showed that if the

More information

Determine the size of an instance of the minimum spanning tree problem.

Determine the size of an instance of the minimum spanning tree problem. 3.1 Algorithm complexity Consider two alternative algorithms A and B for solving a given problem. Suppose A is O(n 2 ) and B is O(2 n ), where n is the size of the instance. Let n A 0 be the size of the

More information

CALU: A Communication Optimal LU Factorization Algorithm

CALU: A Communication Optimal LU Factorization Algorithm CALU: A Communication Optimal LU Factorization Algorithm James Demmel Laura Grigori Hua Xiang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-9

More information