Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
|
|
- Jared Stevens
- 5 years ago
- Views:
Transcription
1 Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco
2 Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and parallel l l l l Load Balancing Rank ordering, Domain decomposition Blocking vs Non blocking Overlap computation and communication MPI-IO and avoiding I/O bottlenecks Hybrid Programming model MPI + OpenMP MPI + Accelerators for GPU clusters
3 Choosing right algorithms: How does your serial algorithm scale? "#$%$&#'! "%()! *+%(,-)!!./01! 2#'3$%'$! 4)$)5(&'&'6!&7!%!'8(9)5!&3!):)'!#5!#;;<!83&'6!%!=%3=!$%9-)!!./!-#6-#6!'1! ;#89-)!-#6! >&';&'6!%'!&$)(!83&'6!&'$)5,#-%$&#'!3)%52=!&'!%!3#5$);!%55%?!!./!-#6!'1! -#6! 2!1A!2B0! C)%52=&'6!&'!%!D;E$5))!!./'1! -&')%5! >&';&'6!%'!&$)(!&'!%'!8'3#5$);!-&3$!#5!&'!%'!8'3#5$);!%55%?!!./!'-#6!'!1! -#6-&')%5! F)57#5(&'6!%!>%3$!>#85&)5!$5%'37#5(<!=)%,3#5$A!G8&2D3#5$!!./' H 1! G8%;5%$&2! "%I:)!9899-)!3#5$!!./' 2 1A!2J0!,#-?'#(&%-! K%$5&+!(8-$&,-&2%$&#'A!&':)53&#'!!./!2 '!1A!2J0! )+,#')'$&%-! >&';&'6!$=)!/)+%2$1!3#-8$&#'!$#!$=)!$5%:)--&'6!3%-)3(%'!,5#9-)(!!./!'L!1! 7%2$#5&%-! 6)')5%$&'6!%--!8'5)3$5&2$);!,)5(8$%$&#'3!#7!%!,#3)$!!
4 Parallel programming concepts: Basic definitions
5 Parallel programming concepts: Performance metrics Speedup: Ratio of parallel to serial execution time Efficiency: The ratio of speedup to the number of processors Work Product of parallel time and processors S = t s t p E = S p W = tp Parallel overhead Idle time wasted in parallel execution T 0 = pt p! t s
6 Parallel programming concepts Analyzing serial vs parallel algorithms Ask yourself What fraction of the code can you completely parallelize? f? How does problem size scale? Processors scale as p. How does problem-size n scale with p? n(p)? How does parallel overhead grow? T o (p)? Does the problem scale? M(p) Serial drawbacks Latency Single core Parallel drawbacks Extra computation Communication/ idling Contention(Memory, Network) Serial percentage
7 Parallel programming concepts Analyzing serial vs parallel algorithms Amdahl s law What fraction of the code can you completely parallelize? f? Serial time: t s Parallel time: ft s + (1-f) (t s /p) Serial drawbacks Latency Single core Parallel drawbacks Extra computation Communication/ idling Contention(Memory, Network) Serial percentage S = t s t p = = t s t s f + (1! f )t s p 1 f + (1! f ) p
8 Parallel programming concepts Analyzing serial vs parallel algorithms Quiz if the serial fraction is 5%, what is the maximum speedup you can achieve? Serial drawbacks Parallel drawbacks Serial time = 100 secs Serial percentage = 5 % Maximum speedup? Latency Single core Extra computation Communication/ idling Contention(Memory, Network) Serial percentage
9 Parallel programming concepts Analyzing serial vs parallel algorithms Amdahl s law What fraction of the code can you completely parallelize? f? Serial time = 100 secs Parallel percentage = 5 % Serial drawbacks Latency Single core Parallel drawbacks Extra computation Communication/ idling Contention(Memory, Network) = = (1! f ) p = 20 Serial percentage
10 Parallel programming concepts Analyzing serial vs parallel algorithms Amdahl's Law approximately suggests: Suppose a car is traveling between two cities 60 miles apart, and has already spent one hour traveling half the distance at 30 mph. No matter how fast you drive the last half, it is impossible to achieve 90 mph average before reaching the second city Serial drawbacks Latency Extra computatio n Communication/ idling Single core Gustafson's Law approximately states: Suppose a car has already been traveling for some time at less than 90mph. Given enough time and distance to travel, the car's average speed can always eventually reach 90mph, no matter how long or how slowly it has already traveled. Parallel drawbacks Contention(Mem ory, Network) Serial percentage Source:
11 Parallel programming concepts Analyzing serial vs parallel algorithms Communication Overhead Simplest model: Serial drawbacks Parallel drawbacks Transfer time = Startup time + Hop time(node latency) + (Message length)/bandwidth = t s + t h l + t w l Send one big message instead of several small messages! Reduce the total amount of bytes! Bandwidth depends on protocol Latency Single core Extra computation Communication/ idling Contention(Memory, Network) Serial percentage
12 Parallel programming concepts Analyzing serial vs parallel algorithms Point to point (MPI_Send) ( t s + t w m ) Collective overhead All-to-all Broadcast (MPI Allgather): t s log 2 p + (p 1) t w m All-reduce (MPI Allreduce) : ( t s + t w m ) log 2 p Scatter and Gather (MPI Scatter) : ( t s + t w m ) log 2 p All to all (personalized): (p 1) ( t s + t w m ) Serial drawbacks Latency Single core Parallel drawbacks Extra computation Communication/ idling Contention(Memory, Network) Serial percentage
13 Parallel programming: Basic definitions Isoefficiency: Can we maintain efficiency/speedup of the algorithm? How should the problem size scale with p to keep efficiency constant? E = 1 1+ (T o / w) Maintain ratio To(W,p) / W, overhead to parallel work constant
14 Parallel programming: Basic definitions Isoefficiency relation: To keep efficiency constant you must increase problem size such that T (n,1)! T o (n, p) Procedure: 1. Get the sequential time T(n,1) 2. Get the parallel time pt(n,p) 3. Calculate the overhead T o =pt(n,p) - T(n,1) How does the overhead compare to the useful work being done?
15 Parallel programming: Basic definitions Isoefficiency relation: To keep efficiency constant you must increase problem size such that T (n,1)! T o (n, p) Scalability: Do you have enough resources(memory) to scale to that size Maintain ratio To(W,p) / W, overhead to parallel work constant
16 Parallel programming: Basic definitions Scalaility: Do you have enough resources(memory) to scale to that size Memory needed per processor Cannot maintain efficiency Memory Size Cplogp Can maintain efficiency C Cp Number of processors
17 Adding numbers Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Serial time n Parallel time n / p Communicate and add log p + log p
18 Adding numbers Each processor has n/p numbers. Steps to communicate and add are 2 log p Speedup: n S =! n " # p + 2log p $ % & E = n n + 2 plog p ( ) Isoefficiency If you increase problem size n as O(p log p) then efficiency can remain constant!
19 Adding numbers Each processor has n/p numbers. Steps to communicate and add are 2 log p Speedup: Scalability E = n n + 2 plog p ( ) M (n)! (n / p) = plog p / p = log p
20 Sorting Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Our plan 1. Split list into parts 2. Sort parts individually 3. Merge lists to get sorted list
21 Sorting Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Background Bubble sort O(n 2 ) Stable Exchanging Selection sort O(n 2 ) Unstable Selection Insertion sort O(n 2 ) Stable Insertion Merge sort O(nlog n) Stable Merging Quick sort O(nlog n) Unstable Partitioning
22 Choosing right algorithms: Optimal serial and parallel Case Study: Bubble sort Main loop For i : 1 to length_of(a) -1 [ ] Secondary loop For j : i+1 to length_of(a) Compare and swap so smaller element is to left if ( A[j] < A[i] ) swap( A[i], A[j] )
23 Choosing right algorithms: Optimal serial and parallel Case Study: Bubble sort Compare Swap Main loop For i : 1 to length_of(a) -1 Secondary loop For j : i+1 to length_of(a) Compare and swap so smaller element is to left if ( A[j] < A[i] ) swap( A[i], A[j] ) (i=1, j=2) [ ] Y (i=1, j=3) [ ] N (i=1, j=4) [ ] N (i=2, j=3) [ ] Y (i=2, j=4) [ ] Y (i=3, j=4) [ ] Y
24 Choosing right algorithms: Optimal serial and parallel Case Study: Bubble sort Compare Swap Main loop For i : 1 to length_of(a) -1 Secondary loop For j : i+1 to length_of(a) Compare and swap so smaller element is to left if ( A[j] < A[i] ) swap( A[i], A[j] ) (i=1, j=2) [ ] Y (i=1, j=3) [ ] N (i=1, j=4) [ ] N (i=2, j=3) [ ] Y N(N-1)/2 = O(N 2 ) Comparisons (i=2, j=4) [ ] Y (i=3, j=4) [ ] Y
25 Choosing right algorithms: Optimal serial and parallel Case Study: Merge sort Recursively merge lists having one element each Split [ ] [ 1 5 ] [ 4 2 ] [ 1 ] [ 5 ] [ 4 ] [ 2 ] Merge [ 1 5 ] [ 2 4 ] [ ]
26 Choosing right algorithms: Optimal serial and parallel Case Study: Merge sort Recursively merge lists having one element each Split [ ] [ 1 5 ] [ 4 2 ] [ 1 ] [ 5 ] [ 4 ] [ 2 ] Merge Best sorting algorithms need O( N log N) [ 1 5 ] [ 2 4 ] [ ]
27 Choosing right algorithms: Parallel sorting technique Merging : Merge p lists having n/p elements each Sub-optimally : Pop-push merge on 1 processor O( np) 3 [ ] => [ 1 ] [ ] [ ] => [ 1 2 ] [ ] n-1 [ ] => [ ] [ ]
28 Choosing right algorithms: Parallel sorting technique Merging : Merge p lists having n/p elements each Optimal: Recursively or tree based Merge 2 lists on 1 0 Merge 4 lists on 2 Merge 8 lists on Best merge algorithms need O( nlog p)
29 Sorting Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Serial time nlogn Parallel time n p log! " # n $ p% & Merge time (n!1). n p n p + nlog p
30 Sorting Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Overhead ( ) = nlogn! p (n / p)log(n / p) + nlog p! plog p + nplog p! nlog p Isoefficiency nlogn! c(nlog p) " n! p c Scalability = n / p = p c!1 Low for c>2
31 Data decomposition 1D : Row-wise R R R
32 Data decomposition 1D : Column-wise C C C
33 Data decomposition 2D : Block-wise B1 B2 B B4 B5 B6 B6 B7 B8
34 Data decomposition Laplace solver: (n x n) mesh with p processors Time to communicate 1 cell: t comm cell =! s + t w Time to evaluate stencil once: t comp cell = 5 *(t float ) a(i, j+1) a(i+1, j) a(i-1, j) a(i, j+1)
35 Data decomposition Laplace solver: 1D Row-wise (n x n) with p processors a(i+1, j) Proc 0 a(i, j+1) a(i, j+1) a(i-1, j) Proc 1
36 Data decomposition Laplace solver: 1D Row-wise (n x n) with p processors Serial time: t seq comp = n 2 t cell Parallel computation time comp t process comm = n2 p t comp cell Ghost communication: t comm comm = 2nt cell a(i, j+1) a(i+1, j) a(i-1, j) a(i, j+1)
37 Data decomposition Laplace solver: 1D Row-wise (n x n) with p processors Overhead: a(i+1, j) = t seq! pt p = pn a(i, j+1) a(i, j+1) Isoefficieny: n 2! cnp " n! cp a(i-1, j) Scalability : = c 2 p 2 / p = Cp Poor Scaling
38 Data decomposition Laplace solver: 2D Block-wise a(i+1, j) Proc 0 Proc 2 a(i, j+1) a(i, j+1) a(i-1, j) Proc 1 Proc 3
39 Data decomposition Laplace solver: 2D Row-wise (n x n) with p processors Serial time: a(i+1, j) t comp seq = n 2 comm t cell a(i, j+1) a(i, j+1) Parallel computation time: a(i-1, j) comp t process = n2 p t comp cell Ghost communication: t comm = 4n p t comm cell
40 Data decomposition Laplace solver: 2D Row-wise (n x n) with p processors Overhead: = p.n / p a(i, j+1) a(i+1, j) a(i, j+1) = n p a(i-1, j) Isoefficiency: n! p Scalability : = (c p) 2 / p = C Perfect Scaling
41 Data decomposition Matrix vector multiplication: 1D row-wise decomposition Computation: Each processor computes n/p elements, n multiplies + (n-1) adds for each! " # O n2 p $ % & Communication: All gather in the end so each processor has full copy of output vector log p log p + " 2 i!1. n = log p + p i=1 n(p!1) p Proc 0 Proc 1 Proc X 1 2 3
42 Data decomposition Matrix vector multiplication: 1D row-wise decomposition Algorithm: 1. Collect vector using MPI_Allgather 2. Local matrix multiplication to get output vector Proc Wastes much memory Proc X 2 Proc
43 Data decomposition Matrix vector multiplication: 1D row-wise decomposition Computation: Each processor computes n/p elements, n multiplies + (n-1) adds for each! " # O n2 p $ % & Proc Communication: All gather in the end so each processor has full copy of output vector Proc X 2! w n +! s log p Overhead:! s plog p +! w np Proc
44 Data decomposition Matrix vector multiplication: 1D row-wise decomposition Speedup: S = p " 1+ p(! s log p + t w n) % # $ t c n 2 & ' Proc Isoefficiency: n 2! plog p + np! n " cp Proc X 2 Scalability: M (p)! n 2 / p = c 2 p Proc Not scalable!
45 Data decomposition Matrix vector multiplication: 1D column-wise decomposition Serial Computation? Parallel Computation? Proc 0 Proc 1 Proc 2 Overhead? Isoefficiency? Scalability?
46 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm: Uses p Processors on a grid A 00 A 01 A 02 proc 0 proc 1 proc 2 v 0 A 10 A 11 A 12 proc 3 proc 4 proc5 X v 1 A 20 A 21 A 22 proc 6 proc 7 proc 8 v 2
47 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm Step 0: Copy part-vector to diagonal A 00 A 01 A 02 v 0 proc 0 proc 1 proc 2 A 10 A 11 A 12 proc 3 proc 4 proc5 X v 1 A 20 A 21 A 22 v 2 proc 6 proc 7 proc 8
48 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm Step 1: Broadcast vector along columns A 00 A 01 A 02 v 0 proc 0 proc 1 proc 2 A 10 A 11 A 12 proc 3 proc 4 proc5 X v 1 A 20 A 21 A 22 v 2 proc 6 proc 7 proc 8
49 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm Step 2: Local computation on each processor A 00 v o A 01 v 1 A 02 v 2 v 0 proc 0 proc 1 proc 2 A 10 v 0 A 11 v 1 A 12 v 2 proc 3 proc 4 proc5 X v 1 A 20 v 0 A 21 v 1 A 22 v 2 proc 6 proc 7 proc 8 v 2
50 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm Step 3: Reduce across rows u 0 A 00 v o A 01 v 1 A 02 v 2 v 0 proc 0 proc 1 proc 2 u 1 A 10 v 0 A 11 v 1 A 12 v 2 proc 3 proc 4 proc5 X v 1 u 2 A 20 v 0 A 21 v 1 A 22 v 2 v 2 proc 6 proc 7 proc 8
51 Data decomposition Matrix vector multiplication: 2D decomposition Computation: Each processor computes n 2 /p elements, " n 2 %! c # $ p & ' Communication: 2n p log( p) + n p! nlog p p A 00 A 01 A 02 proc 0 proc 1 proc 2 A 10 A 11 A 12 proc 3 proc 4 proc5 v 0 v 1 X Overhead: n p log(p) A 20 A 21 A 22 proc 6 proc 7 proc 8 v 2
52 Data decomposition Matrix vector multiplication: 2D decomposition Isoefficiency: n 2! n p log p! n " c p log p A 00 A 01 A 02 proc 0 proc 1 proc 2 v 0 Scalability: M (p)! n2 p = (log p)2 A 10 A 11 A 12 proc 3 proc 4 proc5 v 1 X Scales better than 1D! A 20 A 21 A 22 proc 6 proc 7 proc 8 v 2
53 Lets look at the code
54 Summary u Know your algorithm! u Don t expect the unexpected! u Pay attention to parallel design and implementation right from the outset. It will save you lot of labor.
55
56
Performance Evaluation of Codes. Performance Metrics
CS6230 Performance Evaluation of Codes Performance Metrics Aim to understanding the algorithmic issues in obtaining high performance from large scale parallel computers Topics for Conisderation General
More informationParallel Program Performance Analysis
Parallel Program Performance Analysis Chris Kauffman CS 499: Spring 2016 GMU Logistics Today Final details of HW2 interviews HW2 timings HW2 Questions Parallel Performance Theory Special Office Hours Mon
More informationBarrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers
Overview: Synchronous Computations Barrier barriers: linear, tree-based and butterfly degrees of synchronization synchronous example : Jacobi Iterations serial and parallel code, performance analysis synchronous
More informationPerformance and Scalability. Lars Karlsson
Performance and Scalability Lars Karlsson Outline Complexity analysis Runtime, speedup, efficiency Amdahl s Law and scalability Cost and overhead Cost optimality Iso-efficiency function Case study: matrix
More informationOverview: Synchronous Computations
Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous
More informationRecap from the previous lecture on Analytical Modeling
COSC 637 Parallel Computation Analytical Modeling of Parallel Programs (II) Edgar Gabriel Fall 20 Recap from the previous lecture on Analytical Modeling Speedup: S p = T s / T p (p) Efficiency E = S p
More informationSolution of Linear Systems
Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 12, 2016 CPD (DEI / IST) Parallel and Distributed Computing
More informationNotation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing
Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE
More informationAnalytical Modeling of Parallel Systems
Analytical Modeling of Parallel Systems Chieh-Sen (Jason) Huang Department of Applied Mathematics National Sun Yat-sen University Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing
More informationReview: From problem to parallel algorithm
Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:
More information1 / 28. Parallel Programming.
1 / 28 Parallel Programming pauldj@aices.rwth-aachen.de Collective Communication 2 / 28 Barrier Broadcast Reduce Scatter Gather Allgather Reduce-scatter Allreduce Alltoall. References Collective Communication:
More informationParallel Programming. Parallel algorithms Linear systems solvers
Parallel Programming Parallel algorithms Linear systems solvers Terminology System of linear equations Solve Ax = b for x Special matrices Upper triangular Lower triangular Diagonally dominant Symmetric
More informationAlgorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen
Algorithms PART II: Partitioning and Divide & Conquer HPC Fall 2007 Prof. Robert van Engelen Overview Partitioning strategies Divide and conquer strategies Further reading HPC Fall 2007 2 Partitioning
More informationAnalytical Modeling of Parallel Programs (Chapter 5) Alexandre David
Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on
More informationParallel Performance Theory - 1
Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis
More informationAnalytical Modeling of Parallel Programs. S. Oliveira
Analytical Modeling of Parallel Programs S. Oliveira Fall 2005 1 Scalability of Parallel Systems Efficiency of a parallel program E = S/P = T s /PT p Using the parallel overhead expression E = 1/(1 + T
More informationImage Reconstruction And Poisson s equation
Chapter 1, p. 1/58 Image Reconstruction And Poisson s equation School of Engineering Sciences Parallel s for Large-Scale Problems I Chapter 1, p. 2/58 Outline 1 2 3 4 Chapter 1, p. 3/58 Question What have
More informationA Computation- and Communication-Optimal Parallel Direct 3-body Algorithm
A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,
More informationEECS 358 Introduction to Parallel Computing Final Assignment
EECS 358 Introduction to Parallel Computing Final Assignment Jiangtao Gou Zhenyu Zhao March 19, 2013 1 Problem 1 1.1 Matrix-vector Multiplication on Hypercube and Torus As shown in slide 15.11, we assumed
More informationHybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC
Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,
More informationLecture 17: Analytical Modeling of Parallel Programs: Scalability CSCE 569 Parallel Computing
Lecture 17: Analytical Modeling of Parallel Programs: Scalability CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu http://cse.sc.edu/~yanyh 1 Topic
More informationSolving PDEs with CUDA Jonathan Cohen
Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear
More informationCRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?
CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?
More informationStatic-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems
Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse
More informationAlgorithms for Collective Communication. Design and Analysis of Parallel Algorithms
Algorithms for Collective Communication Design and Analysis of Parallel Algorithms Source A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, Chapter 4, 2003. Outline One-to-all
More informationImprovements for Implicit Linear Equation Solvers
Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often
More informationProgram Performance Metrics
Program Performance Metrics he parallel run time (par) is the time from the moment when computation starts to the moment when the last processor finished his execution he speedup (S) is defined as the
More informationModels: Amdahl s Law, PRAM, α-β Tal Ben-Nun
spcl.inf.ethz.ch @spcl_eth Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun Design of Parallel and High-Performance Computing Fall 2017 DPHPC Overview cache coherency memory models 2 Speedup An application
More informationParallel Performance Theory
AMS 250: An Introduction to High Performance Computing Parallel Performance Theory Shawfeng Dong shaw@ucsc.edu (831) 502-7743 Applied Mathematics & Statistics University of California, Santa Cruz Outline
More informationComputer Algorithms CISC4080 CIS, Fordham Univ. Outline. Last class. Instructor: X. Zhang Lecture 2
Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Outline Introduction to algorithm analysis: fibonacci seq calculation counting number of computer steps recursive formula
More informationComputer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2
Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Outline Introduction to algorithm analysis: fibonacci seq calculation counting number of computer steps recursive formula
More informationCS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University
CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution
More information2.5D algorithms for distributed-memory computing
ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization
More information3D Parallel Elastodynamic Modeling of Large Subduction Earthquakes
D Parallel Elastodynamic Modeling of Large Subduction Earthquakes Eduardo Cabrera 1 Mario Chavez 2 Raúl Madariaga Narciso Perea 2 and Marco Frisenda 1 Supercomputing Dept. DGSCA UNAM C.U. 04510 Mexico
More informationLinear Systems of Equations by Gaussian Elimination
Chapter 6, p. 1/32 Linear of Equations by School of Engineering Sciences Parallel Computations for Large-Scale Problems I Chapter 6, p. 2/32 Outline 1 2 3 4 The Problem Consider the system a 1,1 x 1 +
More informationParallelization of the QC-lib Quantum Computer Simulator Library
Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers
More information= ( 1 P + S V P) 1. Speedup T S /T V p
Numerical Simulation - Homework Solutions 2013 1. Amdahl s law. (28%) Consider parallel processing with common memory. The number of parallel processor is 254. Explain why Amdahl s law for this situation
More informationCOL 730: Parallel Programming
COL 730: Parallel Programming PARALLEL SORTING Bitonic Merge and Sort Bitonic sequence: {a 0, a 1,, a n-1 }: A sequence with a monotonically increasing part and a monotonically decreasing part For some
More informationAccelerating computation of eigenvectors in the nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationPerformance optimization of WEST and Qbox on Intel Knights Landing
Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University
More informationParallel Scientific Computing
IV-1 Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication. Direct method for solving a linear equation. Gaussian Elimination. Iterative method for solving a linear equation.
More informationCommunication-avoiding parallel and sequential QR factorizations
Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at
More informationAccelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem
Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National
More informationHigh Performance Computing
Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),
More informationPRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM
Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value
More informationModelling and implementation of algorithms in applied mathematics using MPI
Modelling and implementation of algorithms in applied mathematics using MPI Lecture 3: Linear Systems: Simple Iterative Methods and their parallelization, Programming MPI G. Rapin Brazil March 2011 Outline
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 13 Finite Difference Methods Outline n Ordinary and partial differential equations n Finite difference methods n Vibrating string
More informationMore Science per Joule: Bottleneck Computing
More Science per Joule: Bottleneck Computing Georg Hager Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg Germany PPAM 2013 September 9, 2013 Warsaw, Poland Motivation (1): Scalability
More informationMatrix Computations: Direct Methods II. May 5, 2014 Lecture 11
Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would
More informationCME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.
CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax
More informationHYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni
HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI 1989 Sanjay Ranka and Sartaj Sahni 1 2 Chapter 1 Introduction 1.1 Parallel Architectures Parallel computers may
More informationSearching. Sorting. Lambdas
.. s Babes-Bolyai University arthur@cs.ubbcluj.ro Overview 1 2 3 Feedback for the course You can write feedback at academicinfo.ubbcluj.ro It is both important as well as anonymous Write both what you
More informationProjectile Motion Slide 1/16. Projectile Motion. Fall Semester. Parallel Computing
Projectile Motion Slide 1/16 Projectile Motion Fall Semester Projectile Motion Slide 2/16 Topic Outline Historical Perspective ABC and ENIAC Ballistics tables Projectile Motion Air resistance Euler s method
More informationOn The Energy Complexity of Parallel Algorithms
On The Energy Complexity of Parallel Algorithms Vijay Anand Korthikanti Department of Computer Science University of Illinois at Urbana Champaign vkortho2@illinois.edu Gul Agha Department of Computer Science
More informationAlgorithm Analysis, Asymptotic notations CISC4080 CIS, Fordham Univ. Instructor: X. Zhang
Algorithm Analysis, Asymptotic notations CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Last class Introduction to algorithm analysis: fibonacci seq calculation counting number of computer steps recursive
More informationTwo case studies of Monte Carlo simulation on GPU
Two case studies of Monte Carlo simulation on GPU National Institute for Computational Sciences University of Tennessee Seminar series on HPC, Feb. 27, 2014 Outline 1 Introduction 2 Discrete energy lattice
More informationScalable Non-blocking Preconditioned Conjugate Gradient Methods
Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William
More informationLecture 4. Writing parallel programs with MPI Measuring performance
Lecture 4 Writing parallel programs with MPI Measuring performance Announcements Wednesday s office hour moved to 1.30 A new version of Ring (Ring_new) that handles linear sequences of message lengths
More informationLecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 23: Illusiveness of Parallel Performance James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L23 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping peel
More informationSolution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI
Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Sagar Bhatt Person Number: 50170651 Department of Mechanical and Aerospace Engineering,
More informationModule 1: Analyzing the Efficiency of Algorithms
Module 1: Analyzing the Efficiency of Algorithms Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu What is an Algorithm?
More informationParallelization of the QC-lib Quantum Computer Simulator Library
Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/
More informationPerformance Analysis of Lattice QCD Application with APGAS Programming Model
Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models
More informationLecture 4. Quicksort
Lecture 4. Quicksort T. H. Cormen, C. E. Leiserson and R. L. Rivest Introduction to Algorithms, 3rd Edition, MIT Press, 2009 Sungkyunkwan University Hyunseung Choo choo@skku.edu Copyright 2000-2018 Networking
More informationRESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE
RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE Yuan-chun Zhao a, b, Cheng-ming Li b a. Shandong University of Science and Technology, Qingdao 266510 b. Chinese Academy of
More informationOptimization Techniques for Parallel Code 1. Parallel programming models
Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of
More informationAnalysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel
Performance Analysis Introduction Analysis of execution time for arallel algorithm to dertmine if it is worth the effort to code and debug in arallel Understanding barriers to high erformance and redict
More informationTiming Results of a Parallel FFTsynth
Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1994 Timing Results of a Parallel FFTsynth Robert E. Lynch Purdue University, rel@cs.purdue.edu
More informationDirect Self-Consistent Field Computations on GPU Clusters
Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd
More informationArnaud Legrand, CNRS, University of Grenoble. November 1, LIG laboratory, Theoretical Parallel Computing. A.
Theoretical Arnaud Legrand, CNRS, University of Grenoble LIG laboratory, arnaudlegrand@imagfr November 1, 2009 1 / 63 Outline Theoretical 1 2 3 2 / 63 Outline Theoretical 1 2 3 3 / 63 Models for Computation
More informationParallel Numerics. Scope: Revise standard numerical methods considering parallel computations!
Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:
More informationCS 310 Advanced Data Structures and Algorithms
CS 310 Advanced Data Structures and Algorithms Runtime Analysis May 31, 2017 Tong Wang UMass Boston CS 310 May 31, 2017 1 / 37 Topics Weiss chapter 5 What is algorithm analysis Big O, big, big notations
More informationDivisible Load Scheduling
Divisible Load Scheduling Henri Casanova 1,2 1 Associate Professor Department of Information and Computer Science University of Hawai i at Manoa, U.S.A. 2 Visiting Associate Professor National Institute
More informationTheoretical Computer Science
Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized
More informationPetaBricks: Variable Accuracy and Online Learning
PetaBricks: Variable Accuracy and Online Learning Jason Ansel MIT - CSAIL May 4, 2011 Jason Ansel (MIT) PetaBricks May 4, 2011 1 / 40 Outline 1 Motivating Example 2 PetaBricks Language Overview 3 Variable
More informationFinite difference methods. Finite difference methods p. 1
Finite difference methods Finite difference methods p. 1 Overview 1D heat equation u t = κu xx +f(x,t) as a motivating example Quick intro of the finite difference method Recapitulation of parallelization
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationEfficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method
Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method S. Contassot-Vivier, R. Couturier, C. Denis and F. Jézéquel Université
More informationLecture 22: Multithreaded Algorithms CSCI Algorithms I. Andrew Rosenberg
Lecture 22: Multithreaded Algorithms CSCI 700 - Algorithms I Andrew Rosenberg Last Time Open Addressing Hashing Today Multithreading Two Styles of Threading Shared Memory Every thread can access the same
More informationSWE Anatomy of a Parallel Shallow Water Code
SWE Anatomy of a Parallel Shallow Water Code CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering,
More informationIntroduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra
Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and
More informationExtending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations
September 18, 2012 Extending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations Yuxing Peng, Chris Knight, Philip Blood, Lonnie Crosby, and Gregory A. Voth Outline Proton Solvation
More informationParallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano
Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic
More informationData Structures and Algorithms CSE 465
Data Structures and Algorithms CSE 465 LECTURE 8 Analyzing Quick Sort Sofya Raskhodnikova and Adam Smith Reminder: QuickSort Quicksort an n-element array: 1. Divide: Partition the array around a pivot
More informationAnalysis of Algorithms CMPSC 565
Analysis of Algorithms CMPSC 565 LECTURES 38-39 Randomized Algorithms II Quickselect Quicksort Running time Adam Smith L1.1 Types of randomized analysis Average-case analysis : Assume data is distributed
More informationdata structures and algorithms lecture 2
data structures and algorithms 2018 09 06 lecture 2 recall: insertion sort Algorithm insertionsort(a, n): for j := 2 to n do key := A[j] i := j 1 while i 1 and A[i] > key do A[i + 1] := A[i] i := i 1 A[i
More informationScalability Metrics for Parallel Molecular Dynamics
Scalability Metrics for arallel Molecular Dynamics arallel Efficiency We define the efficiency of a parallel program running on processors to solve a problem of size W Let T(W, ) be the execution time
More informationParallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors
Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer
More informationVerifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University
Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University Goals of Verifiable Computation Provide user with correctness
More informationThe Performance Evolution of the Parallel Ocean Program on the Cray X1
The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott
More informationParallel Matrix Factorization for Recommender Systems
Under consideration for publication in Knowledge and Information Systems Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon Department of
More informationCA-SVM: Communication-Avoiding Support Vector Machines on Distributed System
CA-SVM: Communication-Avoiding Support Vector Machines on Distributed System Yang You 1, James Demmel 1, Kent Czechowski 2, Le Song 2, Richard Vuduc 2 UC Berkeley 1, Georgia Tech 2 Yang You (Speaker) James
More informationModel Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University
Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul
More informationA Nested Dissection Parallel Direct Solver. for Simulations of 3D DC/AC Resistivity. Measurements. Maciej Paszyński (1,2)
A Nested Dissection Parallel Direct Solver for Simulations of 3D DC/AC Resistivity Measurements Maciej Paszyński (1,2) David Pardo (2), Carlos Torres-Verdín (2) (1) Department of Computer Science, AGH
More informationImpression Store: Compressive Sensing-based Storage for. Big Data Analytics
Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research The Curse of O(N) in
More information0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA
0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA 2008-09 Salvatore Orlando 1 0-1 Knapsack problem N objects, j=1,..,n Each kind of item j has a value p j and a weight w j (single
More informationERLANGEN REGIONAL COMPUTING CENTER
ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,
More informationLecture 4: Linear Algebra 1
Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation
More information