Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Size: px
Start display at page:

Download "Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco"

Transcription

1 Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

2 Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and parallel l l l l Load Balancing Rank ordering, Domain decomposition Blocking vs Non blocking Overlap computation and communication MPI-IO and avoiding I/O bottlenecks Hybrid Programming model MPI + OpenMP MPI + Accelerators for GPU clusters

3 Choosing right algorithms: How does your serial algorithm scale? "#$%$&#'! "%()! *+%(,-)!!./01! 2#'3$%'$! 4)$)5(&'&'6!&7!%!'8(9)5!&3!):)'!#5!#;;<!83&'6!%!=%3=!$%9-)!!./!-#6-#6!'1! ;#89-)!-#6! >&';&'6!%'!&$)(!83&'6!&'$)5,#-%$&#'!3)%52=!&'!%!3#5$);!%55%?!!./!-#6!'1! -#6! 2!1A!2B0! C)%52=&'6!&'!%!D;E$5))!!./'1! -&')%5! >&';&'6!%'!&$)(!&'!%'!8'3#5$);!-&3$!#5!&'!%'!8'3#5$);!%55%?!!./!'-#6!'!1! -#6-&')%5! F)57#5(&'6!%!>%3$!>#85&)5!$5%'37#5(<!=)%,3#5$A!G8&2D3#5$!!./' H 1! G8%;5%$&2! "%I:)!9899-)!3#5$!!./' 2 1A!2J0!,#-?'#(&%-! K%$5&+!(8-$&,-&2%$&#'A!&':)53&#'!!./!2 '!1A!2J0! )+,#')'$&%-! >&';&'6!$=)!/)+%2$1!3#-8$&#'!$#!$=)!$5%:)--&'6!3%-)3(%'!,5#9-)(!!./!'L!1! 7%2$#5&%-! 6)')5%$&'6!%--!8'5)3$5&2$);!,)5(8$%$&#'3!#7!%!,#3)$!!

4 Parallel programming concepts: Basic definitions

5 Parallel programming concepts: Performance metrics Speedup: Ratio of parallel to serial execution time Efficiency: The ratio of speedup to the number of processors Work Product of parallel time and processors S = t s t p E = S p W = tp Parallel overhead Idle time wasted in parallel execution T 0 = pt p! t s

6 Parallel programming concepts Analyzing serial vs parallel algorithms Ask yourself What fraction of the code can you completely parallelize? f? How does problem size scale? Processors scale as p. How does problem-size n scale with p? n(p)? How does parallel overhead grow? T o (p)? Does the problem scale? M(p) Serial drawbacks Latency Single core Parallel drawbacks Extra computation Communication/ idling Contention(Memory, Network) Serial percentage

7 Parallel programming concepts Analyzing serial vs parallel algorithms Amdahl s law What fraction of the code can you completely parallelize? f? Serial time: t s Parallel time: ft s + (1-f) (t s /p) Serial drawbacks Latency Single core Parallel drawbacks Extra computation Communication/ idling Contention(Memory, Network) Serial percentage S = t s t p = = t s t s f + (1! f )t s p 1 f + (1! f ) p

8 Parallel programming concepts Analyzing serial vs parallel algorithms Quiz if the serial fraction is 5%, what is the maximum speedup you can achieve? Serial drawbacks Parallel drawbacks Serial time = 100 secs Serial percentage = 5 % Maximum speedup? Latency Single core Extra computation Communication/ idling Contention(Memory, Network) Serial percentage

9 Parallel programming concepts Analyzing serial vs parallel algorithms Amdahl s law What fraction of the code can you completely parallelize? f? Serial time = 100 secs Parallel percentage = 5 % Serial drawbacks Latency Single core Parallel drawbacks Extra computation Communication/ idling Contention(Memory, Network) = = (1! f ) p = 20 Serial percentage

10 Parallel programming concepts Analyzing serial vs parallel algorithms Amdahl's Law approximately suggests: Suppose a car is traveling between two cities 60 miles apart, and has already spent one hour traveling half the distance at 30 mph. No matter how fast you drive the last half, it is impossible to achieve 90 mph average before reaching the second city Serial drawbacks Latency Extra computatio n Communication/ idling Single core Gustafson's Law approximately states: Suppose a car has already been traveling for some time at less than 90mph. Given enough time and distance to travel, the car's average speed can always eventually reach 90mph, no matter how long or how slowly it has already traveled. Parallel drawbacks Contention(Mem ory, Network) Serial percentage Source:

11 Parallel programming concepts Analyzing serial vs parallel algorithms Communication Overhead Simplest model: Serial drawbacks Parallel drawbacks Transfer time = Startup time + Hop time(node latency) + (Message length)/bandwidth = t s + t h l + t w l Send one big message instead of several small messages! Reduce the total amount of bytes! Bandwidth depends on protocol Latency Single core Extra computation Communication/ idling Contention(Memory, Network) Serial percentage

12 Parallel programming concepts Analyzing serial vs parallel algorithms Point to point (MPI_Send) ( t s + t w m ) Collective overhead All-to-all Broadcast (MPI Allgather): t s log 2 p + (p 1) t w m All-reduce (MPI Allreduce) : ( t s + t w m ) log 2 p Scatter and Gather (MPI Scatter) : ( t s + t w m ) log 2 p All to all (personalized): (p 1) ( t s + t w m ) Serial drawbacks Latency Single core Parallel drawbacks Extra computation Communication/ idling Contention(Memory, Network) Serial percentage

13 Parallel programming: Basic definitions Isoefficiency: Can we maintain efficiency/speedup of the algorithm? How should the problem size scale with p to keep efficiency constant? E = 1 1+ (T o / w) Maintain ratio To(W,p) / W, overhead to parallel work constant

14 Parallel programming: Basic definitions Isoefficiency relation: To keep efficiency constant you must increase problem size such that T (n,1)! T o (n, p) Procedure: 1. Get the sequential time T(n,1) 2. Get the parallel time pt(n,p) 3. Calculate the overhead T o =pt(n,p) - T(n,1) How does the overhead compare to the useful work being done?

15 Parallel programming: Basic definitions Isoefficiency relation: To keep efficiency constant you must increase problem size such that T (n,1)! T o (n, p) Scalability: Do you have enough resources(memory) to scale to that size Maintain ratio To(W,p) / W, overhead to parallel work constant

16 Parallel programming: Basic definitions Scalaility: Do you have enough resources(memory) to scale to that size Memory needed per processor Cannot maintain efficiency Memory Size Cplogp Can maintain efficiency C Cp Number of processors

17 Adding numbers Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Serial time n Parallel time n / p Communicate and add log p + log p

18 Adding numbers Each processor has n/p numbers. Steps to communicate and add are 2 log p Speedup: n S =! n " # p + 2log p $ % & E = n n + 2 plog p ( ) Isoefficiency If you increase problem size n as O(p log p) then efficiency can remain constant!

19 Adding numbers Each processor has n/p numbers. Steps to communicate and add are 2 log p Speedup: Scalability E = n n + 2 plog p ( ) M (n)! (n / p) = plog p / p = log p

20 Sorting Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Our plan 1. Split list into parts 2. Sort parts individually 3. Merge lists to get sorted list

21 Sorting Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Background Bubble sort O(n 2 ) Stable Exchanging Selection sort O(n 2 ) Unstable Selection Insertion sort O(n 2 ) Stable Insertion Merge sort O(nlog n) Stable Merging Quick sort O(nlog n) Unstable Partitioning

22 Choosing right algorithms: Optimal serial and parallel Case Study: Bubble sort Main loop For i : 1 to length_of(a) -1 [ ] Secondary loop For j : i+1 to length_of(a) Compare and swap so smaller element is to left if ( A[j] < A[i] ) swap( A[i], A[j] )

23 Choosing right algorithms: Optimal serial and parallel Case Study: Bubble sort Compare Swap Main loop For i : 1 to length_of(a) -1 Secondary loop For j : i+1 to length_of(a) Compare and swap so smaller element is to left if ( A[j] < A[i] ) swap( A[i], A[j] ) (i=1, j=2) [ ] Y (i=1, j=3) [ ] N (i=1, j=4) [ ] N (i=2, j=3) [ ] Y (i=2, j=4) [ ] Y (i=3, j=4) [ ] Y

24 Choosing right algorithms: Optimal serial and parallel Case Study: Bubble sort Compare Swap Main loop For i : 1 to length_of(a) -1 Secondary loop For j : i+1 to length_of(a) Compare and swap so smaller element is to left if ( A[j] < A[i] ) swap( A[i], A[j] ) (i=1, j=2) [ ] Y (i=1, j=3) [ ] N (i=1, j=4) [ ] N (i=2, j=3) [ ] Y N(N-1)/2 = O(N 2 ) Comparisons (i=2, j=4) [ ] Y (i=3, j=4) [ ] Y

25 Choosing right algorithms: Optimal serial and parallel Case Study: Merge sort Recursively merge lists having one element each Split [ ] [ 1 5 ] [ 4 2 ] [ 1 ] [ 5 ] [ 4 ] [ 2 ] Merge [ 1 5 ] [ 2 4 ] [ ]

26 Choosing right algorithms: Optimal serial and parallel Case Study: Merge sort Recursively merge lists having one element each Split [ ] [ 1 5 ] [ 4 2 ] [ 1 ] [ 5 ] [ 4 ] [ 2 ] Merge Best sorting algorithms need O( N log N) [ 1 5 ] [ 2 4 ] [ ]

27 Choosing right algorithms: Parallel sorting technique Merging : Merge p lists having n/p elements each Sub-optimally : Pop-push merge on 1 processor O( np) 3 [ ] => [ 1 ] [ ] [ ] => [ 1 2 ] [ ] n-1 [ ] => [ ] [ ]

28 Choosing right algorithms: Parallel sorting technique Merging : Merge p lists having n/p elements each Optimal: Recursively or tree based Merge 2 lists on 1 0 Merge 4 lists on 2 Merge 8 lists on Best merge algorithms need O( nlog p)

29 Sorting Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Serial time nlogn Parallel time n p log! " # n $ p% & Merge time (n!1). n p n p + nlog p

30 Sorting Each processor has n/p numbers. 4, 3, 9 1, 3, 7 7,5,8 5,2,9 Overhead ( ) = nlogn! p (n / p)log(n / p) + nlog p! plog p + nplog p! nlog p Isoefficiency nlogn! c(nlog p) " n! p c Scalability = n / p = p c!1 Low for c>2

31 Data decomposition 1D : Row-wise R R R

32 Data decomposition 1D : Column-wise C C C

33 Data decomposition 2D : Block-wise B1 B2 B B4 B5 B6 B6 B7 B8

34 Data decomposition Laplace solver: (n x n) mesh with p processors Time to communicate 1 cell: t comm cell =! s + t w Time to evaluate stencil once: t comp cell = 5 *(t float ) a(i, j+1) a(i+1, j) a(i-1, j) a(i, j+1)

35 Data decomposition Laplace solver: 1D Row-wise (n x n) with p processors a(i+1, j) Proc 0 a(i, j+1) a(i, j+1) a(i-1, j) Proc 1

36 Data decomposition Laplace solver: 1D Row-wise (n x n) with p processors Serial time: t seq comp = n 2 t cell Parallel computation time comp t process comm = n2 p t comp cell Ghost communication: t comm comm = 2nt cell a(i, j+1) a(i+1, j) a(i-1, j) a(i, j+1)

37 Data decomposition Laplace solver: 1D Row-wise (n x n) with p processors Overhead: a(i+1, j) = t seq! pt p = pn a(i, j+1) a(i, j+1) Isoefficieny: n 2! cnp " n! cp a(i-1, j) Scalability : = c 2 p 2 / p = Cp Poor Scaling

38 Data decomposition Laplace solver: 2D Block-wise a(i+1, j) Proc 0 Proc 2 a(i, j+1) a(i, j+1) a(i-1, j) Proc 1 Proc 3

39 Data decomposition Laplace solver: 2D Row-wise (n x n) with p processors Serial time: a(i+1, j) t comp seq = n 2 comm t cell a(i, j+1) a(i, j+1) Parallel computation time: a(i-1, j) comp t process = n2 p t comp cell Ghost communication: t comm = 4n p t comm cell

40 Data decomposition Laplace solver: 2D Row-wise (n x n) with p processors Overhead: = p.n / p a(i, j+1) a(i+1, j) a(i, j+1) = n p a(i-1, j) Isoefficiency: n! p Scalability : = (c p) 2 / p = C Perfect Scaling

41 Data decomposition Matrix vector multiplication: 1D row-wise decomposition Computation: Each processor computes n/p elements, n multiplies + (n-1) adds for each! " # O n2 p $ % & Communication: All gather in the end so each processor has full copy of output vector log p log p + " 2 i!1. n = log p + p i=1 n(p!1) p Proc 0 Proc 1 Proc X 1 2 3

42 Data decomposition Matrix vector multiplication: 1D row-wise decomposition Algorithm: 1. Collect vector using MPI_Allgather 2. Local matrix multiplication to get output vector Proc Wastes much memory Proc X 2 Proc

43 Data decomposition Matrix vector multiplication: 1D row-wise decomposition Computation: Each processor computes n/p elements, n multiplies + (n-1) adds for each! " # O n2 p $ % & Proc Communication: All gather in the end so each processor has full copy of output vector Proc X 2! w n +! s log p Overhead:! s plog p +! w np Proc

44 Data decomposition Matrix vector multiplication: 1D row-wise decomposition Speedup: S = p " 1+ p(! s log p + t w n) % # $ t c n 2 & ' Proc Isoefficiency: n 2! plog p + np! n " cp Proc X 2 Scalability: M (p)! n 2 / p = c 2 p Proc Not scalable!

45 Data decomposition Matrix vector multiplication: 1D column-wise decomposition Serial Computation? Parallel Computation? Proc 0 Proc 1 Proc 2 Overhead? Isoefficiency? Scalability?

46 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm: Uses p Processors on a grid A 00 A 01 A 02 proc 0 proc 1 proc 2 v 0 A 10 A 11 A 12 proc 3 proc 4 proc5 X v 1 A 20 A 21 A 22 proc 6 proc 7 proc 8 v 2

47 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm Step 0: Copy part-vector to diagonal A 00 A 01 A 02 v 0 proc 0 proc 1 proc 2 A 10 A 11 A 12 proc 3 proc 4 proc5 X v 1 A 20 A 21 A 22 v 2 proc 6 proc 7 proc 8

48 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm Step 1: Broadcast vector along columns A 00 A 01 A 02 v 0 proc 0 proc 1 proc 2 A 10 A 11 A 12 proc 3 proc 4 proc5 X v 1 A 20 A 21 A 22 v 2 proc 6 proc 7 proc 8

49 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm Step 2: Local computation on each processor A 00 v o A 01 v 1 A 02 v 2 v 0 proc 0 proc 1 proc 2 A 10 v 0 A 11 v 1 A 12 v 2 proc 3 proc 4 proc5 X v 1 A 20 v 0 A 21 v 1 A 22 v 2 proc 6 proc 7 proc 8 v 2

50 Data decomposition Matrix vector multiplication: 2D decomposition Algorithm Step 3: Reduce across rows u 0 A 00 v o A 01 v 1 A 02 v 2 v 0 proc 0 proc 1 proc 2 u 1 A 10 v 0 A 11 v 1 A 12 v 2 proc 3 proc 4 proc5 X v 1 u 2 A 20 v 0 A 21 v 1 A 22 v 2 v 2 proc 6 proc 7 proc 8

51 Data decomposition Matrix vector multiplication: 2D decomposition Computation: Each processor computes n 2 /p elements, " n 2 %! c # $ p & ' Communication: 2n p log( p) + n p! nlog p p A 00 A 01 A 02 proc 0 proc 1 proc 2 A 10 A 11 A 12 proc 3 proc 4 proc5 v 0 v 1 X Overhead: n p log(p) A 20 A 21 A 22 proc 6 proc 7 proc 8 v 2

52 Data decomposition Matrix vector multiplication: 2D decomposition Isoefficiency: n 2! n p log p! n " c p log p A 00 A 01 A 02 proc 0 proc 1 proc 2 v 0 Scalability: M (p)! n2 p = (log p)2 A 10 A 11 A 12 proc 3 proc 4 proc5 v 1 X Scales better than 1D! A 20 A 21 A 22 proc 6 proc 7 proc 8 v 2

53 Lets look at the code

54 Summary u Know your algorithm! u Don t expect the unexpected! u Pay attention to parallel design and implementation right from the outset. It will save you lot of labor.

55

56

Performance Evaluation of Codes. Performance Metrics

Performance Evaluation of Codes. Performance Metrics CS6230 Performance Evaluation of Codes Performance Metrics Aim to understanding the algorithmic issues in obtaining high performance from large scale parallel computers Topics for Conisderation General

More information

Parallel Program Performance Analysis

Parallel Program Performance Analysis Parallel Program Performance Analysis Chris Kauffman CS 499: Spring 2016 GMU Logistics Today Final details of HW2 interviews HW2 timings HW2 Questions Parallel Performance Theory Special Office Hours Mon

More information

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers Overview: Synchronous Computations Barrier barriers: linear, tree-based and butterfly degrees of synchronization synchronous example : Jacobi Iterations serial and parallel code, performance analysis synchronous

More information

Performance and Scalability. Lars Karlsson

Performance and Scalability. Lars Karlsson Performance and Scalability Lars Karlsson Outline Complexity analysis Runtime, speedup, efficiency Amdahl s Law and scalability Cost and overhead Cost optimality Iso-efficiency function Case study: matrix

More information

Overview: Synchronous Computations

Overview: Synchronous Computations Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous

More information

Recap from the previous lecture on Analytical Modeling

Recap from the previous lecture on Analytical Modeling COSC 637 Parallel Computation Analytical Modeling of Parallel Programs (II) Edgar Gabriel Fall 20 Recap from the previous lecture on Analytical Modeling Speedup: S p = T s / T p (p) Efficiency E = S p

More information

Solution of Linear Systems

Solution of Linear Systems Solution of Linear Systems Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 12, 2016 CPD (DEI / IST) Parallel and Distributed Computing

More information

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE

More information

Analytical Modeling of Parallel Systems

Analytical Modeling of Parallel Systems Analytical Modeling of Parallel Systems Chieh-Sen (Jason) Huang Department of Applied Mathematics National Sun Yat-sen University Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing

More information

Review: From problem to parallel algorithm

Review: From problem to parallel algorithm Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:

More information

1 / 28. Parallel Programming.

1 / 28. Parallel Programming. 1 / 28 Parallel Programming pauldj@aices.rwth-aachen.de Collective Communication 2 / 28 Barrier Broadcast Reduce Scatter Gather Allgather Reduce-scatter Allreduce Alltoall. References Collective Communication:

More information

Parallel Programming. Parallel algorithms Linear systems solvers

Parallel Programming. Parallel algorithms Linear systems solvers Parallel Programming Parallel algorithms Linear systems solvers Terminology System of linear equations Solve Ax = b for x Special matrices Upper triangular Lower triangular Diagonally dominant Symmetric

More information

Algorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen

Algorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen Algorithms PART II: Partitioning and Divide & Conquer HPC Fall 2007 Prof. Robert van Engelen Overview Partitioning strategies Divide and conquer strategies Further reading HPC Fall 2007 2 Partitioning

More information

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on

More information

Parallel Performance Theory - 1

Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis

More information

Analytical Modeling of Parallel Programs. S. Oliveira

Analytical Modeling of Parallel Programs. S. Oliveira Analytical Modeling of Parallel Programs S. Oliveira Fall 2005 1 Scalability of Parallel Systems Efficiency of a parallel program E = S/P = T s /PT p Using the parallel overhead expression E = 1/(1 + T

More information

Image Reconstruction And Poisson s equation

Image Reconstruction And Poisson s equation Chapter 1, p. 1/58 Image Reconstruction And Poisson s equation School of Engineering Sciences Parallel s for Large-Scale Problems I Chapter 1, p. 2/58 Outline 1 2 3 4 Chapter 1, p. 3/58 Question What have

More information

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,

More information

EECS 358 Introduction to Parallel Computing Final Assignment

EECS 358 Introduction to Parallel Computing Final Assignment EECS 358 Introduction to Parallel Computing Final Assignment Jiangtao Gou Zhenyu Zhao March 19, 2013 1 Problem 1 1.1 Matrix-vector Multiplication on Hypercube and Torus As shown in slide 15.11, we assumed

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Lecture 17: Analytical Modeling of Parallel Programs: Scalability CSCE 569 Parallel Computing

Lecture 17: Analytical Modeling of Parallel Programs: Scalability CSCE 569 Parallel Computing Lecture 17: Analytical Modeling of Parallel Programs: Scalability CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu http://cse.sc.edu/~yanyh 1 Topic

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel? CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms

Algorithms for Collective Communication. Design and Analysis of Parallel Algorithms Algorithms for Collective Communication Design and Analysis of Parallel Algorithms Source A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, Chapter 4, 2003. Outline One-to-all

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

Program Performance Metrics

Program Performance Metrics Program Performance Metrics he parallel run time (par) is the time from the moment when computation starts to the moment when the last processor finished his execution he speedup (S) is defined as the

More information

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun

Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun spcl.inf.ethz.ch @spcl_eth Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun Design of Parallel and High-Performance Computing Fall 2017 DPHPC Overview cache coherency memory models 2 Speedup An application

More information

Parallel Performance Theory

Parallel Performance Theory AMS 250: An Introduction to High Performance Computing Parallel Performance Theory Shawfeng Dong shaw@ucsc.edu (831) 502-7743 Applied Mathematics & Statistics University of California, Santa Cruz Outline

More information

Computer Algorithms CISC4080 CIS, Fordham Univ. Outline. Last class. Instructor: X. Zhang Lecture 2

Computer Algorithms CISC4080 CIS, Fordham Univ. Outline. Last class. Instructor: X. Zhang Lecture 2 Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Outline Introduction to algorithm analysis: fibonacci seq calculation counting number of computer steps recursive formula

More information

Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2

Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Computer Algorithms CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Lecture 2 Outline Introduction to algorithm analysis: fibonacci seq calculation counting number of computer steps recursive formula

More information

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

3D Parallel Elastodynamic Modeling of Large Subduction Earthquakes

3D Parallel Elastodynamic Modeling of Large Subduction Earthquakes D Parallel Elastodynamic Modeling of Large Subduction Earthquakes Eduardo Cabrera 1 Mario Chavez 2 Raúl Madariaga Narciso Perea 2 and Marco Frisenda 1 Supercomputing Dept. DGSCA UNAM C.U. 04510 Mexico

More information

Linear Systems of Equations by Gaussian Elimination

Linear Systems of Equations by Gaussian Elimination Chapter 6, p. 1/32 Linear of Equations by School of Engineering Sciences Parallel Computations for Large-Scale Problems I Chapter 6, p. 2/32 Outline 1 2 3 4 The Problem Consider the system a 1,1 x 1 +

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers

More information

= ( 1 P + S V P) 1. Speedup T S /T V p

= ( 1 P + S V P) 1. Speedup T S /T V p Numerical Simulation - Homework Solutions 2013 1. Amdahl s law. (28%) Consider parallel processing with common memory. The number of parallel processor is 254. Explain why Amdahl s law for this situation

More information

COL 730: Parallel Programming

COL 730: Parallel Programming COL 730: Parallel Programming PARALLEL SORTING Bitonic Merge and Sort Bitonic sequence: {a 0, a 1,, a n-1 }: A sequence with a monotonically increasing part and a monotonically decreasing part For some

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Performance optimization of WEST and Qbox on Intel Knights Landing

Performance optimization of WEST and Qbox on Intel Knights Landing Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University

More information

Parallel Scientific Computing

Parallel Scientific Computing IV-1 Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication. Direct method for solving a linear equation. Gaussian Elimination. Iterative method for solving a linear equation.

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

High Performance Computing

High Performance Computing Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),

More information

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM Proceedings of ALGORITMY 25 pp. 22 211 PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM GABRIEL OKŠA AND MARIÁN VAJTERŠIC Abstract. One way, how to speed up the computation of the singular value

More information

Modelling and implementation of algorithms in applied mathematics using MPI

Modelling and implementation of algorithms in applied mathematics using MPI Modelling and implementation of algorithms in applied mathematics using MPI Lecture 3: Linear Systems: Simple Iterative Methods and their parallelization, Programming MPI G. Rapin Brazil March 2011 Outline

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 13 Finite Difference Methods Outline n Ordinary and partial differential equations n Finite difference methods n Vibrating string

More information

More Science per Joule: Bottleneck Computing

More Science per Joule: Bottleneck Computing More Science per Joule: Bottleneck Computing Georg Hager Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg Germany PPAM 2013 September 9, 2013 Warsaw, Poland Motivation (1): Scalability

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI 1989 Sanjay Ranka and Sartaj Sahni 1 2 Chapter 1 Introduction 1.1 Parallel Architectures Parallel computers may

More information

Searching. Sorting. Lambdas

Searching. Sorting. Lambdas .. s Babes-Bolyai University arthur@cs.ubbcluj.ro Overview 1 2 3 Feedback for the course You can write feedback at academicinfo.ubbcluj.ro It is both important as well as anonymous Write both what you

More information

Projectile Motion Slide 1/16. Projectile Motion. Fall Semester. Parallel Computing

Projectile Motion Slide 1/16. Projectile Motion. Fall Semester. Parallel Computing Projectile Motion Slide 1/16 Projectile Motion Fall Semester Projectile Motion Slide 2/16 Topic Outline Historical Perspective ABC and ENIAC Ballistics tables Projectile Motion Air resistance Euler s method

More information

On The Energy Complexity of Parallel Algorithms

On The Energy Complexity of Parallel Algorithms On The Energy Complexity of Parallel Algorithms Vijay Anand Korthikanti Department of Computer Science University of Illinois at Urbana Champaign vkortho2@illinois.edu Gul Agha Department of Computer Science

More information

Algorithm Analysis, Asymptotic notations CISC4080 CIS, Fordham Univ. Instructor: X. Zhang

Algorithm Analysis, Asymptotic notations CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Algorithm Analysis, Asymptotic notations CISC4080 CIS, Fordham Univ. Instructor: X. Zhang Last class Introduction to algorithm analysis: fibonacci seq calculation counting number of computer steps recursive

More information

Two case studies of Monte Carlo simulation on GPU

Two case studies of Monte Carlo simulation on GPU Two case studies of Monte Carlo simulation on GPU National Institute for Computational Sciences University of Tennessee Seminar series on HPC, Feb. 27, 2014 Outline 1 Introduction 2 Discrete energy lattice

More information

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Scalable Non-blocking Preconditioned Conjugate Gradient Methods Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William

More information

Lecture 4. Writing parallel programs with MPI Measuring performance

Lecture 4. Writing parallel programs with MPI Measuring performance Lecture 4 Writing parallel programs with MPI Measuring performance Announcements Wednesday s office hour moved to 1.30 A new version of Ring (Ring_new) that handles linear sequences of message lengths

More information

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 23: Illusiveness of Parallel Performance James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L23 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping peel

More information

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Sagar Bhatt Person Number: 50170651 Department of Mechanical and Aerospace Engineering,

More information

Module 1: Analyzing the Efficiency of Algorithms

Module 1: Analyzing the Efficiency of Algorithms Module 1: Analyzing the Efficiency of Algorithms Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu What is an Algorithm?

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

Lecture 4. Quicksort

Lecture 4. Quicksort Lecture 4. Quicksort T. H. Cormen, C. E. Leiserson and R. L. Rivest Introduction to Algorithms, 3rd Edition, MIT Press, 2009 Sungkyunkwan University Hyunseung Choo choo@skku.edu Copyright 2000-2018 Networking

More information

RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE

RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE Yuan-chun Zhao a, b, Cheng-ming Li b a. Shandong University of Science and Technology, Qingdao 266510 b. Chinese Academy of

More information

Optimization Techniques for Parallel Code 1. Parallel programming models

Optimization Techniques for Parallel Code 1. Parallel programming models Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of

More information

Analysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel

Analysis of execution time for parallel algorithm to dertmine if it is worth the effort to code and debug in parallel Performance Analysis Introduction Analysis of execution time for arallel algorithm to dertmine if it is worth the effort to code and debug in arallel Understanding barriers to high erformance and redict

More information

Timing Results of a Parallel FFTsynth

Timing Results of a Parallel FFTsynth Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1994 Timing Results of a Parallel FFTsynth Robert E. Lynch Purdue University, rel@cs.purdue.edu

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Arnaud Legrand, CNRS, University of Grenoble. November 1, LIG laboratory, Theoretical Parallel Computing. A.

Arnaud Legrand, CNRS, University of Grenoble. November 1, LIG laboratory, Theoretical Parallel Computing. A. Theoretical Arnaud Legrand, CNRS, University of Grenoble LIG laboratory, arnaudlegrand@imagfr November 1, 2009 1 / 63 Outline Theoretical 1 2 3 2 / 63 Outline Theoretical 1 2 3 3 / 63 Models for Computation

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

CS 310 Advanced Data Structures and Algorithms

CS 310 Advanced Data Structures and Algorithms CS 310 Advanced Data Structures and Algorithms Runtime Analysis May 31, 2017 Tong Wang UMass Boston CS 310 May 31, 2017 1 / 37 Topics Weiss chapter 5 What is algorithm analysis Big O, big, big notations

More information

Divisible Load Scheduling

Divisible Load Scheduling Divisible Load Scheduling Henri Casanova 1,2 1 Associate Professor Department of Information and Computer Science University of Hawai i at Manoa, U.S.A. 2 Visiting Associate Professor National Institute

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 412 (2011) 1484 1491 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: wwwelseviercom/locate/tcs Parallel QR processing of Generalized

More information

PetaBricks: Variable Accuracy and Online Learning

PetaBricks: Variable Accuracy and Online Learning PetaBricks: Variable Accuracy and Online Learning Jason Ansel MIT - CSAIL May 4, 2011 Jason Ansel (MIT) PetaBricks May 4, 2011 1 / 40 Outline 1 Motivating Example 2 PetaBricks Language Overview 3 Variable

More information

Finite difference methods. Finite difference methods p. 1

Finite difference methods. Finite difference methods p. 1 Finite difference methods Finite difference methods p. 1 Overview 1D heat equation u t = κu xx +f(x,t) as a motivating example Quick intro of the finite difference method Recapitulation of parallelization

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method

Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method Efficiently solving large sparse linear systems on a distributed and heterogeneous grid by using the multisplitting-direct method S. Contassot-Vivier, R. Couturier, C. Denis and F. Jézéquel Université

More information

Lecture 22: Multithreaded Algorithms CSCI Algorithms I. Andrew Rosenberg

Lecture 22: Multithreaded Algorithms CSCI Algorithms I. Andrew Rosenberg Lecture 22: Multithreaded Algorithms CSCI 700 - Algorithms I Andrew Rosenberg Last Time Open Addressing Hashing Today Multithreading Two Styles of Threading Shared Memory Every thread can access the same

More information

SWE Anatomy of a Parallel Shallow Water Code

SWE Anatomy of a Parallel Shallow Water Code SWE Anatomy of a Parallel Shallow Water Code CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering,

More information

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra Laura Grigori Abstract Modern, massively parallel computers play a fundamental role in a large and

More information

Extending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations

Extending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations September 18, 2012 Extending Parallel Scalability of LAMMPS and Multiscale Reactive Molecular Simulations Yuxing Peng, Chris Knight, Philip Blood, Lonnie Crosby, and Gregory A. Voth Outline Proton Solvation

More information

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic

More information

Data Structures and Algorithms CSE 465

Data Structures and Algorithms CSE 465 Data Structures and Algorithms CSE 465 LECTURE 8 Analyzing Quick Sort Sofya Raskhodnikova and Adam Smith Reminder: QuickSort Quicksort an n-element array: 1. Divide: Partition the array around a pivot

More information

Analysis of Algorithms CMPSC 565

Analysis of Algorithms CMPSC 565 Analysis of Algorithms CMPSC 565 LECTURES 38-39 Randomized Algorithms II Quickselect Quicksort Running time Adam Smith L1.1 Types of randomized analysis Average-case analysis : Assume data is distributed

More information

data structures and algorithms lecture 2

data structures and algorithms lecture 2 data structures and algorithms 2018 09 06 lecture 2 recall: insertion sort Algorithm insertionsort(a, n): for j := 2 to n do key := A[j] i := j 1 while i 1 and A[i] > key do A[i + 1] := A[i] i := i 1 A[i

More information

Scalability Metrics for Parallel Molecular Dynamics

Scalability Metrics for Parallel Molecular Dynamics Scalability Metrics for arallel Molecular Dynamics arallel Efficiency We define the efficiency of a parallel program running on processors to solve a problem of size W Let T(W, ) be the execution time

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University

Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University Goals of Verifiable Computation Provide user with correctness

More information

The Performance Evolution of the Parallel Ocean Program on the Cray X1

The Performance Evolution of the Parallel Ocean Program on the Cray X1 The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott

More information

Parallel Matrix Factorization for Recommender Systems

Parallel Matrix Factorization for Recommender Systems Under consideration for publication in Knowledge and Information Systems Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon Department of

More information

CA-SVM: Communication-Avoiding Support Vector Machines on Distributed System

CA-SVM: Communication-Avoiding Support Vector Machines on Distributed System CA-SVM: Communication-Avoiding Support Vector Machines on Distributed System Yang You 1, James Demmel 1, Kent Czechowski 2, Le Song 2, Richard Vuduc 2 UC Berkeley 1, Georgia Tech 2 Yang You (Speaker) James

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

A Nested Dissection Parallel Direct Solver. for Simulations of 3D DC/AC Resistivity. Measurements. Maciej Paszyński (1,2)

A Nested Dissection Parallel Direct Solver. for Simulations of 3D DC/AC Resistivity. Measurements. Maciej Paszyński (1,2) A Nested Dissection Parallel Direct Solver for Simulations of 3D DC/AC Resistivity Measurements Maciej Paszyński (1,2) David Pardo (2), Carlos Torres-Verdín (2) (1) Department of Computer Science, AGH

More information

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research The Curse of O(N) in

More information

0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA

0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA 0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA 2008-09 Salvatore Orlando 1 0-1 Knapsack problem N objects, j=1,..,n Each kind of item j has a value p j and a weight w j (single

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

Lecture 4: Linear Algebra 1

Lecture 4: Linear Algebra 1 Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation

More information