Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun

Size: px
Start display at page:

Download "Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun"

Transcription

1 Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun Design of Parallel and High-Performance Computing Fall 2017

2 DPHPC Overview cache coherency memory models 2

3 Speedup An application runs in time T 1 on one processor, T p for p processors. Measured speedup on p processors: T 1 T p Can we bound/estimate speedup without running the application? 3

4 Scaling Models Intuition Consider the following application: auto A = InitializeArray(random_seed, important_info); for (int i = 0; i < 10000; ++i) for (int j = 0; j < 10000; ++j) B[i][j] = Compute(A[i][j]); SequentialDecisionProcess(B); for (int i = 0; i < 10000; ++i) for (int j = 0; j < 10000; ++j) C[i][j] = Compute(B[i][j]); WriteResults(C); Sequential Parallel Sequential Parallel Sequential 4

5 Amdahl s Law Given f 0, 1 the sequential fraction of an application: T p 1 f T 1 p + f T 1 Gene Amdahl

6 Scaling Models Intuition (revisited) Consider the following application: auto A = InitializeArray(random_seed, important_info); for (int i = 0; i < 10000; ++i) for (int j = 0; j < 10000; ++j) B[i][j] = Compute(A[i][j]); Sequential Parallel f T 1 1 f SequentialDecisionProcess(B); Sequential for (int i = 0; i < 10000; ++i) for (int j = 0; j < 10000; ++j) C[i][j] = Compute(B[i][j]); Parallel T 2 WriteResults(C); Sequential f 1 f 2 1 f 2 6

7 Amdahl s Law Given f 0, 1 the sequential fraction of an application: T p 1 f T 1 p Speedup can be modeled as: And efficiency: S p T 1 T p 1 1 f p + f T 1 + f E p = S p p 1 1 f + fp Gene Amdahl

8 (1-f)/2 T 1 (1-f)/3 T 1 (1-f)/4 T 1 (1-f)/5 T 1 Amdahl s Law: Runtime Runtime ft 1 T 1 ft 1 ft 1 ft1 ft1 (1-f)T 1 #Processors/Cores

9 Amdahl s Law: Speedup Speedup linear speedup f = 1/15 f = 1/ Number of Cores

10 Amdahl s Law As p : T f T 1 S 1 f E = 0 (if f 0) 10

11 Question Is Amdahl s law pessimistic or optimistic? 11

12 Amdahl s Law is Pessimistic Fixes the problem size More processors Larger problems f is not a constant fraction, but f(n) The same applies for T 1 and T 1 (n) 12

13 Case Study: Weather Forecasting Grid point climate/atmospheric models GFS GME COSMO The denser the grid, the more accurate prediction becomes Important for anticipating natural disasters! 13

14 Gustafson s Law States: S 1 f n n (if f n 0) John Gustafson Larger problems can be solved (with more resources) at the same time. 14

15 Strong and Weak Scaling Strong scaling: Behavior of S p (n) for a fixed n, and p Fixed problem size as p increases Weak scaling: Behavior of S p (n) for n, p Increasing problem size as p increases 15

16 Pessimism/Optimism Revisited Could both models be optimistic? 16

17 Runtime With Parallel Overhead Runtime Parallel overhead increases with # of processors #Processors/Cores

18 Other Issue: Load Balancing Amdahl s Law assumes that computation divides evenly Processor waiting time is another form of overhead Task started Working time Task completed Waiting time

19 Runtime with Parallel Overhead and Work Imbalance Runtime Time lost due to work imbalance #Processors/Cores

20 Speedup Real Speedup Graphs often Look like This: linear speedup Number of Processors

21 Additional Resources Assumption in both laws: increasing p does not change other resources (e.g. cache) It is thus possible to achieve super-linear speedup: Example 1: Cache size increase leads to larger working set Example 2: Probabilistic convergent algorithms 21

22 Amdahl s Law: Summary Strong, simple models for scaling Ignore overhead incurred by parallelization and load imbalance In reality: S p n = T 1 (n) T p n +A p (n) Programs are not always that simple f does not always exist 22

23 PRAM Model Parallel Random Access Machine Abstract model for shared memory parallelism Shared Memory Any processor can access the memory at unit time What about access conflicts? Exclusive read exclusive write (EREW) Concurrent read exclusive write (CREW) Concurrent read concurrent write (CRCW) Proc. 1 Proc. 2 Proc. p 23

24 PRAM Model: Program Representation Programs represented as DAGs (Directed Acyclic Graphs) Nodes: Unit-time operations Edges: Dependencies Inputs We can now define: W n = #nodes D n =longest path from input to output Average parallelism: W(n)/D(n) 7 6 Outputs 8 24

25 PRAM Model: Advantages Powerful tool for analyzing parallel algorithms No need to manage communication and synchronization Focus on algorithmic aspects Time-unit and processor abstraction is robust w.r.t. architectures 25

26 PRAM Model: Examples Consider reduction: (x 1 + x x n ) Sequential: n W n = Θ(n) D n = Θ(n) Parallelism = O(1) W n = Θ(n) Binary tree: n n-1 n-1 + n D n = Θ(log n) Parallelism = O(n/log n) 26

27 PRAM Model: Examples Merge-sort: List MergeSort(List L) { if (L.length() == 1) return L; auto L1 = MergeSort(L.sublist(0, n / 2)); auto L2 = MergeSort(L.sublist(n / 2 + 1, n)); return merge(l1, L2); } W n = Θ(n log n) D n = D n 2 + Θ n = Θ(n) Average parallelism: Θ(log n) Animation: u/morolin 27

28 PRAM Model: Examples Images: The PRAM Model and Algorithms Prof. Robert van Engelen Scan (prefix sum): Input: L = (x 1,, x n ) Output: 0, x 1, x 1 + x 2,, x i List Scan(List L) { if (L.length() == 1) return List { 0 }; List sums (L.length()/2); for (int i = 0; i < L.length()/2; ++i) sums[i] = L[2 * i - 1] + L[2 * i]; List evens = Scan(sums); List odds(ceil(l.length() / 2.0)); for (int i = 0; i < odds.length(); ++i) odds[i] = evens[i] + L[2 * i]; } return Interleave(evens, odds); 28

29 PRAM Model: Examples Scan (prefix sum): Input: L = (x 1,, x n ) Output: 0, x 1, x 1 + x 2,, x i List Scan(List L) { if (L.length() == 1) return List { 0 }; List sums (L.length()/2); for (int i = 0; i < L.length()/2; ++i) sums[i] = L[2 * i - 1] + L[2 * i]; List evens = Scan(sums); List odds(ceil(l.length() / 2.0)); for (int i = 0; i < odds.length(); ++i) odds[i] = evens[i] + L[2 * i]; W n = W n + Θ n = Θ(n) 2 D n = D n + Θ 1 = Θ(log n) 2 Parallelism: O(n/log n) } return Interleave(evens, odds); 29

30 Reasoning in PRAM Given a DAG with W(n) nodes and D(n) depth: Sequential runtime: T 1 n = W n Time on processors: T n = D(n) Time on p processors: T p n =? T p n D(n), T p n W(n)/p Can we bound T p from above? 30

31 Brent s Theorem T p n D n + W n D n p Inputs Intuition: Bound parallelism at each level of the DAG Outputs 31

32 Brent s Theorem T p breakdown s 1 operations s 1 p 4 5 s 2 operations s 2 p s D operations s D p D s i = W i=1 32

33 Brent s Theorem: Proof Observe: a DAG of depth D can be partitioned to D levels, s.t. there are no interdependencies between nodes at the same level i At each DAG level i, we define T p as the time for layer i and p processors, s i (= T i 1 ) denotes the number of nodes in level i, thus: T i p = s i p s i + p 1 p It follows that: D T p n i=1 D si T i + p 1 p p i=1 33

34 Brent s Theorem: Proof D si + p 1 T p n p i=1 = = 1 p W n + p 1 p D n = = D n + W n D n p D n + W n p To conclude: T 1 p = W n p T p n W n p + D(n) = T 1 p + T 34

35 Speedups S p n = T 1 n T p n S p n p S p n W n D n p D n W n W n p+1 p D n S n = W n D n For fixed n, speedup is limited. For n, speedup can be unbounded Example: Tree reduction 35

36 Concurrency Communication Models: α-β Model Gist: α = Latency (cycles), β = Bandwidth (units/cycle) A Latency α Bandwidth β B How long does it take to send a message of size n? α 1 β n Time 36

37 Concurrency α-β Model α 1 β n Time T n = α + n 1 β 37

38 Little s Law: Primer Problem 1: In Tannenbar, every minute on average 2 students come and leave Each student spends 8 minutes inside Q: How many students are in Tannenbar at a given time? A: 16 Problem 2: In your fridge, you have 150 bottles of Vivi Kola (on average) You drink and buy on average 25 bottles per week Q: How many weeks does every bottle last? A: 6 38

39 Little s Law Queueing theory In a stationary system (input rate = output rate), the number of customers N in a queue is given by: N = λw Where λ = arrival rate and W = average time spent in the system John Little Simple rule, yet powerful: N is independent of I/O distribution Connection: in the α-β model, arrival rate is β and time spent in system is α cycles 39

40 Concurrency Little s Law and the α-β Model β Units arrive per cycle Communication Medium β Units leave per cycle α Cycles spent in system 1 β N Concurrent units in transit N = α β Time 40

41 Example: Memory Systems Latency Throughput = Concurrency Latency is reduced by half every 9 years Throughput doubles every 3 years Parallel processing beneficial! Memory Cache Core 1 Core 2 Real-world statistics: Intel Core 2 (2006): α = 100 cycles, β = 2 bytes/cycle N = 200 Intel Haswell (2014): α = 63 cycles, β = 23 bytes/cycle N =

Parallel Performance Theory - 1

Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis

More information

Analytical Modeling of Parallel Systems

Analytical Modeling of Parallel Systems Analytical Modeling of Parallel Systems Chieh-Sen (Jason) Huang Department of Applied Mathematics National Sun Yat-sen University Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing

More information

Parallel Performance Theory

Parallel Performance Theory AMS 250: An Introduction to High Performance Computing Parallel Performance Theory Shawfeng Dong shaw@ucsc.edu (831) 502-7743 Applied Mathematics & Statistics University of California, Santa Cruz Outline

More information

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE

More information

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on

More information

Lecture 6 September 21, 2016

Lecture 6 September 21, 2016 ICS 643: Advanced Parallel Algorithms Fall 2016 Lecture 6 September 21, 2016 Prof. Nodari Sitchinava Scribe: Tiffany Eulalio 1 Overview In the last lecture, we wrote a non-recursive summation program and

More information

Performance and Scalability. Lars Karlsson

Performance and Scalability. Lars Karlsson Performance and Scalability Lars Karlsson Outline Complexity analysis Runtime, speedup, efficiency Amdahl s Law and scalability Cost and overhead Cost optimality Iso-efficiency function Case study: matrix

More information

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Parallel programming using MPI Analysis and optimization Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco Outline l Parallel programming: Basic definitions l Choosing right algorithms: Optimal serial and

More information

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

MICROPROCESSOR REPORT.   THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE MICROPROCESSOR www.mpronline.com REPORT THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE ENERGY COROLLARIES TO AMDAHL S LAW Analyzing the Interactions Between Parallel Execution and Energy Consumption By

More information

Performance Evaluation of Codes. Performance Metrics

Performance Evaluation of Codes. Performance Metrics CS6230 Performance Evaluation of Codes Performance Metrics Aim to understanding the algorithmic issues in obtaining high performance from large scale parallel computers Topics for Conisderation General

More information

How to deal with uncertainties and dynamicity?

How to deal with uncertainties and dynamicity? How to deal with uncertainties and dynamicity? http://graal.ens-lyon.fr/ lmarchal/scheduling/ 19 novembre 2012 1/ 37 Outline 1 Sensitivity and Robustness 2 Analyzing the sensitivity : the case of Backfilling

More information

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 23: Illusiveness of Parallel Performance James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L23 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping peel

More information

IS 709/809: Computational Methods in IS Research Fall Exam Review

IS 709/809: Computational Methods in IS Research Fall Exam Review IS 709/809: Computational Methods in IS Research Fall 2017 Exam Review Nirmalya Roy Department of Information Systems University of Maryland Baltimore County www.umbc.edu Exam When: Tuesday (11/28) 7:10pm

More information

R ij = 2. Using all of these facts together, you can solve problem number 9.

R ij = 2. Using all of these facts together, you can solve problem number 9. Help for Homework Problem #9 Let G(V,E) be any undirected graph We want to calculate the travel time across the graph. Think of each edge as one resistor of 1 Ohm. Say we have two nodes: i and j Let the

More information

Algorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen

Algorithms PART II: Partitioning and Divide & Conquer. HPC Fall 2007 Prof. Robert van Engelen Algorithms PART II: Partitioning and Divide & Conquer HPC Fall 2007 Prof. Robert van Engelen Overview Partitioning strategies Divide and conquer strategies Further reading HPC Fall 2007 2 Partitioning

More information

Optimization Techniques for Parallel Code 1. Parallel programming models

Optimization Techniques for Parallel Code 1. Parallel programming models Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of

More information

COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions

COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions COMP 633: Parallel Computing Fall 2018 Written Assignment 1: Sample Solutions September 12, 2018 I. The Work-Time W-T presentation of EREW sequence reduction Algorithm 2 in the PRAM handout has work complexity

More information

High Performance Computing

High Performance Computing Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),

More information

All of the above algorithms are such that the total work done by themisω(n 2 m 2 ). (The work done by a parallel algorithm that uses p processors and

All of the above algorithms are such that the total work done by themisω(n 2 m 2 ). (The work done by a parallel algorithm that uses p processors and Efficient Parallel Algorithms for Template Matching Sanguthevar Rajasekaran Department of CISE, University of Florida Abstract. The parallel complexity of template matching has been well studied. In this

More information

CSE 4502/5717 Big Data Analytics Spring 2018; Homework 1 Solutions

CSE 4502/5717 Big Data Analytics Spring 2018; Homework 1 Solutions CSE 502/5717 Big Data Analytics Spring 2018; Homework 1 Solutions 1. Consider the following algorithm: for i := 1 to α n log e n do Pick a random j [1, n]; If a[j] = a[j + 1] or a[j] = a[j 1] then output:

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units

Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Exploring the Potential of Instruction-Level Parallelism of Exposed Datapath Architectures with Buffered Processing Units Anoop Bhagyanath and Klaus Schneider Embedded Systems Chair University of Kaiserslautern

More information

Lecture 17: Analytical Modeling of Parallel Programs: Scalability CSCE 569 Parallel Computing

Lecture 17: Analytical Modeling of Parallel Programs: Scalability CSCE 569 Parallel Computing Lecture 17: Analytical Modeling of Parallel Programs: Scalability CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan yanyh@cse.sc.edu http://cse.sc.edu/~yanyh 1 Topic

More information

Complexity: Some examples

Complexity: Some examples Algorithms and Architectures III: Distributed Systems H-P Schwefel, Jens M. Pedersen Mm6 Distributed storage and access (jmp) Mm7 Introduction to security aspects (hps) Mm8 Parallel complexity (hps) Mm9

More information

Analytical Modeling of Parallel Programs. S. Oliveira

Analytical Modeling of Parallel Programs. S. Oliveira Analytical Modeling of Parallel Programs S. Oliveira Fall 2005 1 Scalability of Parallel Systems Efficiency of a parallel program E = S/P = T s /PT p Using the parallel overhead expression E = 1/(1 + T

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads Sardar Anisul HAQUE a, Marc MORENO MAZA a,b and Ning XIE a a Department of Computer Science, University of Western

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #19 3/28/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class PRAM

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Scalability is Quantifiable

Scalability is Quantifiable Scalability is Quantifiable Universal Scalability Law Baron Schwartz - November 2017 Logistics & Stuff Slides will be posted :) Ask questions anytime! Founder of VividCortex Wrote High Performance MySQL

More information

Divide-and-Conquer Algorithms Part Two

Divide-and-Conquer Algorithms Part Two Divide-and-Conquer Algorithms Part Two Recap from Last Time Divide-and-Conquer Algorithms A divide-and-conquer algorithm is one that works as follows: (Divide) Split the input apart into multiple smaller

More information

PRAMs. M 1 M 2 M p. globaler Speicher

PRAMs. M 1 M 2 M p. globaler Speicher PRAMs A PRAM (parallel random access machine) consists of p many identical processors M,..., M p (RAMs). Processors can read from/write to a shared (global) memory. Processors work synchronously. M M 2

More information

Competitive Concurrent Distributed Queuing

Competitive Concurrent Distributed Queuing Competitive Concurrent Distributed Queuing Maurice Herlihy Srikanta Tirthapura Roger Wattenhofer Brown University Brown University Microsoft Research Providence RI 02912 Providence RI 02912 Redmond WA

More information

Goals for Performance Lecture

Goals for Performance Lecture Goals for Performance Lecture Understand performance, speedup, throughput, latency Relationship between cycle time, cycles/instruction (CPI), number of instructions (the performance equation) Amdahl s

More information

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome

Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression. Sergio De Agostino Sapienza University di Rome Computing Techniques for Parallel and Distributed Systems with an Application to Data Compression Sergio De Agostino Sapienza University di Rome Parallel Systems A parallel random access machine (PRAM)

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

Graph-theoretic Problems

Graph-theoretic Problems Graph-theoretic Problems Parallel algorithms for fundamental graph-theoretic problems: We already used a parallelization of dynamic programming to solve the all-pairs-shortest-path problem. Here we are

More information

= ( 1 P + S V P) 1. Speedup T S /T V p

= ( 1 P + S V P) 1. Speedup T S /T V p Numerical Simulation - Homework Solutions 2013 1. Amdahl s law. (28%) Consider parallel processing with common memory. The number of parallel processor is 254. Explain why Amdahl s law for this situation

More information

Retiming. delay elements in a circuit without affecting the input/output characteristics of the circuit.

Retiming. delay elements in a circuit without affecting the input/output characteristics of the circuit. Chapter Retiming NCU EE -- SP VLSI esign. Chap. Tsung-Han Tsai 1 Retiming & A transformation techniques used to change the locations of delay elements in a circuit without affecting the input/output characteristics

More information

Communication-avoiding LU and QR factorizations for multicore architectures

Communication-avoiding LU and QR factorizations for multicore architectures Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori INRIA Saclay Alok Kumar Gupta BCCS,Norway-5075 16th April 2010 Communication-avoiding

More information

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning Yuchun Ma* Zhuoyuan Li* Jason Cong Xianlong Hong Glenn Reinman Sheqin Dong* Qiang Zhou *Department of Computer Science &

More information

1 Substitution method

1 Substitution method Recurrence Relations we have discussed asymptotic analysis of algorithms and various properties associated with asymptotic notation. As many algorithms are recursive in nature, it is natural to analyze

More information

COL 730: Parallel Programming

COL 730: Parallel Programming COL 730: Parallel Programming PARALLEL SORTING Bitonic Merge and Sort Bitonic sequence: {a 0, a 1,, a n-1 }: A sequence with a monotonically increasing part and a monotonically decreasing part For some

More information

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Marwan Burelle.  Parallel and Concurrent Programming. Introduction and Foundation and and marwan.burelle@lse.epita.fr http://wiki-prog.kh405.net Outline 1 2 and 3 and Evolutions and Next evolutions in processor tends more on more on growing of cores number GPU and similar extensions

More information

Lecture 22: Multithreaded Algorithms CSCI Algorithms I. Andrew Rosenberg

Lecture 22: Multithreaded Algorithms CSCI Algorithms I. Andrew Rosenberg Lecture 22: Multithreaded Algorithms CSCI 700 - Algorithms I Andrew Rosenberg Last Time Open Addressing Hashing Today Multithreading Two Styles of Threading Shared Memory Every thread can access the same

More information

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers

Barrier. Overview: Synchronous Computations. Barriers. Counter-based or Linear Barriers Overview: Synchronous Computations Barrier barriers: linear, tree-based and butterfly degrees of synchronization synchronous example : Jacobi Iterations serial and parallel code, performance analysis synchronous

More information

Arnaud Legrand, CNRS, University of Grenoble. November 1, LIG laboratory, Theoretical Parallel Computing. A.

Arnaud Legrand, CNRS, University of Grenoble. November 1, LIG laboratory, Theoretical Parallel Computing. A. Theoretical Arnaud Legrand, CNRS, University of Grenoble LIG laboratory, arnaudlegrand@imagfr November 1, 2009 1 / 63 Outline Theoretical 1 2 3 2 / 63 Outline Theoretical 1 2 3 3 / 63 Models for Computation

More information

CPU SCHEDULING RONG ZHENG

CPU SCHEDULING RONG ZHENG CPU SCHEDULING RONG ZHENG OVERVIEW Why scheduling? Non-preemptive vs Preemptive policies FCFS, SJF, Round robin, multilevel queues with feedback, guaranteed scheduling 2 SHORT-TERM, MID-TERM, LONG- TERM

More information

Overview: Synchronous Computations

Overview: Synchronous Computations Overview: Synchronous Computations barriers: linear, tree-based and butterfly degrees of synchronization synchronous example 1: Jacobi Iterations serial and parallel code, performance analysis synchronous

More information

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm Matevž Jekovec University of Ljubljana Faculty of Computer and Information Science Oct 10, 2013 Text indexing problem

More information

Definition: Alternating time and space Game Semantics: State of machine determines who

Definition: Alternating time and space Game Semantics: State of machine determines who CMPSCI 601: Recall From Last Time Lecture Definition: Alternating time and space Game Semantics: State of machine determines who controls, White wants it to accept, Black wants it to reject. White wins

More information

Lecture 7: Dynamic Programming I: Optimal BSTs

Lecture 7: Dynamic Programming I: Optimal BSTs 5-750: Graduate Algorithms February, 06 Lecture 7: Dynamic Programming I: Optimal BSTs Lecturer: David Witmer Scribes: Ellango Jothimurugesan, Ziqiang Feng Overview The basic idea of dynamic programming

More information

Multi-threading model

Multi-threading model Multi-threading model High level model of thread processes using spawn and sync. Does not consider the underlying hardware. Algorithm Algorithm-A begin { } spawn Algorithm-B do Algorithm-B in parallel

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni

HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI Sanjay Ranka and Sartaj Sahni HYPERCUBE ALGORITHMS FOR IMAGE PROCESSING AND PATTERN RECOGNITION SANJAY RANKA SARTAJ SAHNI 1989 Sanjay Ranka and Sartaj Sahni 1 2 Chapter 1 Introduction 1.1 Parallel Architectures Parallel computers may

More information

Divisible Load Scheduling

Divisible Load Scheduling Divisible Load Scheduling Henri Casanova 1,2 1 Associate Professor Department of Information and Computer Science University of Hawai i at Manoa, U.S.A. 2 Visiting Associate Professor National Institute

More information

CSM: Operational Analysis

CSM: Operational Analysis CSM: Operational Analysis 2016-17 Computer Science Tripos Part II Computer Systems Modelling: Operational Analysis by Ian Leslie Richard Gibbens, Ian Leslie Operational Analysis Based on the idea of observation

More information

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th CSE 3500 Algorithms and Complexity Fall 2016 Lecture 10: September 29, 2016 Quick sort: Average Run Time In the last lecture we started analyzing the expected run time of quick sort. Let X = k 1, k 2,...,

More information

Parallelism in Structured Newton Computations

Parallelism in Structured Newton Computations Parallelism in Structured Newton Computations Thomas F Coleman and Wei u Department of Combinatorics and Optimization University of Waterloo Waterloo, Ontario, Canada N2L 3G1 E-mail: tfcoleman@uwaterlooca

More information

ICS 252 Introduction to Computer Design

ICS 252 Introduction to Computer Design ICS 252 fall 2006 Eli Bozorgzadeh Computer Science Department-UCI References and Copyright Textbooks referred [Mic94] G. De Micheli Synthesis and Optimization of Digital Circuits McGraw-Hill, 1994. [CLR90]

More information

Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact

Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact Ph.D Thesis Presentation of Sardar Anisul Haque Supervisor: Dr. Marc Moreno Maza Ontario Research Centre for Computer Algebra

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.

More information

CSE 548: Analysis of Algorithms. Lecture 12 ( Analyzing Parallel Algorithms )

CSE 548: Analysis of Algorithms. Lecture 12 ( Analyzing Parallel Algorithms ) CSE 548: Analysis of Algorithms Lecture 12 ( Analyzing Parallel Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Fall 2017 Why Parallelism? Moore s Law Source: Wikipedia

More information

Revisiting the Cache Miss Analysis of Multithreaded Algorithms

Revisiting the Cache Miss Analysis of Multithreaded Algorithms Revisiting the Cache Miss Analysis of Multithreaded Algorithms Richard Cole Vijaya Ramachandran September 19, 2012 Abstract This paper concerns the cache miss analysis of algorithms when scheduled in work-stealing

More information

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic

More information

Partha Sarathi Mandal

Partha Sarathi Mandal MA 252: Data Structures and Algorithms Lecture 32 http://www.iitg.ernet.in/psm/indexing_ma252/y12/index.html Partha Sarathi Mandal Dept. of Mathematics, IIT Guwahati The All-Pairs Shortest Paths Problem

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

Timing Results of a Parallel FFTsynth

Timing Results of a Parallel FFTsynth Purdue University Purdue e-pubs Department of Computer Science Technical Reports Department of Computer Science 1994 Timing Results of a Parallel FFTsynth Robert E. Lynch Purdue University, rel@cs.purdue.edu

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

A Note on Parallel Algorithmic Speedup Bounds

A Note on Parallel Algorithmic Speedup Bounds arxiv:1104.4078v1 [cs.dc] 20 Apr 2011 A Note on Parallel Algorithmic Speedup Bounds Neil J. Gunther February 8, 2011 Abstract A parallel program can be represented as a directed acyclic graph. An important

More information

CIS 4930/6930: Principles of Cyber-Physical Systems

CIS 4930/6930: Principles of Cyber-Physical Systems CIS 4930/6930: Principles of Cyber-Physical Systems Chapter 11 Scheduling Hao Zheng Department of Computer Science and Engineering University of South Florida H. Zheng (CSE USF) CIS 4930/6930: Principles

More information

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts. TDDI4 Concurrent Programming, Operating Systems, and Real-time Operating Systems CPU Scheduling Overview: CPU Scheduling CPU bursts and I/O bursts Scheduling Criteria Scheduling Algorithms Multiprocessor

More information

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ CS-206 Concurrency Lecture 13 Wrap Up Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Created by Nooshin Mirzadeh, Georgios Psaropoulos and Babak Falsafi EPFL Copyright 2015 EPFL CS-206 Spring

More information

n-level Graph Partitioning

n-level Graph Partitioning Vitaly Osipov, Peter Sanders - Algorithmik II 1 Vitaly Osipov: KIT Universität des Landes Baden-Württemberg und nationales Grossforschungszentrum in der Helmholtz-Gemeinschaft Institut für Theoretische

More information

Review for the Midterm Exam

Review for the Midterm Exam Review for the Midterm Exam 1 Three Questions of the Computational Science Prelim scaled speedup network topologies work stealing 2 The in-class Spring 2012 Midterm Exam pleasingly parallel computations

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

Searching. Sorting. Lambdas

Searching. Sorting. Lambdas .. s Babes-Bolyai University arthur@cs.ubbcluj.ro Overview 1 2 3 Feedback for the course You can write feedback at academicinfo.ubbcluj.ro It is both important as well as anonymous Write both what you

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture Interconnection Network Performance Performance Analysis of Interconnection Networks Bandwidth Latency Proportional to diameter Latency with contention Processor

More information

Cost-Constrained Matchings and Disjoint Paths

Cost-Constrained Matchings and Disjoint Paths Cost-Constrained Matchings and Disjoint Paths Kenneth A. Berman 1 Department of ECE & Computer Science University of Cincinnati, Cincinnati, OH Abstract Let G = (V, E) be a graph, where the edges are weighted

More information

Review: From problem to parallel algorithm

Review: From problem to parallel algorithm Review: From problem to parallel algorithm Mathematical formulations of interesting problems abound Poisson s equation Sources: Electrostatics, gravity, fluid flow, image processing (!) Numerical solution:

More information

A Starvation-free Algorithm For Achieving 100% Throughput in an Input- Queued Switch

A Starvation-free Algorithm For Achieving 100% Throughput in an Input- Queued Switch A Starvation-free Algorithm For Achieving 00% Throughput in an Input- Queued Switch Abstract Adisak ekkittikul ick ckeown Department of Electrical Engineering Stanford University Stanford CA 9405-400 Tel

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Behavioral Simulations in MapReduce

Behavioral Simulations in MapReduce Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?

More information

Analyzing Contention and Backoff in Asynchronous Shared Memory

Analyzing Contention and Backoff in Asynchronous Shared Memory Analyzing Contention and Backoff in Asynchronous Shared Memory ABSTRACT Naama Ben-David Carnegie Mellon University nbendavi@cs.cmu.edu Randomized backoff protocols have long been used to reduce contention

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche Department of Computer Science University of Warwick June 2006 Outline 1 Introduction Motivation The BSP

More information

QR Decomposition in a Multicore Environment

QR Decomposition in a Multicore Environment QR Decomposition in a Multicore Environment Omar Ahsan University of Maryland-College Park Advised by Professor Howard Elman College Park, MD oha@cs.umd.edu ABSTRACT In this study we examine performance

More information

Exam Spring Embedded Systems. Prof. L. Thiele

Exam Spring Embedded Systems. Prof. L. Thiele Exam Spring 20 Embedded Systems Prof. L. Thiele NOTE: The given solution is only a proposal. For correctness, completeness, or understandability no responsibility is taken. Sommer 20 Eingebettete Systeme

More information

CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms

CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms Prof. Gregory Provan Department of Computer Science University College Cork 1 Lecture Outline CS 4407, Algorithms Growth Functions

More information

Performance Evaluation of MPI on Weather and Hydrological Models

Performance Evaluation of MPI on Weather and Hydrological Models NCAR/RAL Performance Evaluation of MPI on Weather and Hydrological Models Alessandro Fanfarillo elfanfa@ucar.edu August 8th 2018 Cheyenne - NCAR Supercomputer Cheyenne is a 5.34-petaflops, high-performance

More information

Overlay networks maximizing throughput

Overlay networks maximizing throughput Overlay networks maximizing throughput Olivier Beaumont, Lionel Eyraud-Dubois, Shailesh Kumar Agrawal Cepage team, LaBRI, Bordeaux, France IPDPS April 20, 2010 Outline Introduction 1 Introduction 2 Complexity

More information

Definition: Alternating time and space Game Semantics: State of machine determines who

Definition: Alternating time and space Game Semantics: State of machine determines who CMPSCI 601: Recall From Last Time Lecture 3 Definition: Alternating time and space Game Semantics: State of machine determines who controls, White wants it to accept, Black wants it to reject. White wins

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

CMP N 301 Computer Architecture. Appendix C

CMP N 301 Computer Architecture. Appendix C CMP N 301 Computer Architecture Appendix C Outline Introduction Pipelining Hazards Pipelining Implementation Exception Handling Advanced Issues (Dynamic Scheduling, Out of order Issue, Superscalar, etc)

More information

Parallelism and Machine Models

Parallelism and Machine Models Parallelism and Machine Models Andrew D Smith University of New Brunswick, Fredericton Faculty of Computer Science Overview Part 1: The Parallel Computation Thesis Part 2: Parallelism of Arithmetic RAMs

More information

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research The Curse of O(N) in

More information

Data structures Exercise 1 solution. Question 1. Let s start by writing all the functions in big O notation:

Data structures Exercise 1 solution. Question 1. Let s start by writing all the functions in big O notation: Data structures Exercise 1 solution Question 1 Let s start by writing all the functions in big O notation: f 1 (n) = 2017 = O(1), f 2 (n) = 2 log 2 n = O(n 2 ), f 3 (n) = 2 n = O(2 n ), f 4 (n) = 1 = O

More information