Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun

Size: px

Start display at page:

Download "Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun"

Homer Hill
5 years ago
Views:

1 Models: Amdahl s Law, PRAM, α-β Tal Ben-Nun Design of Parallel and High-Performance Computing Fall 2017

2 DPHPC Overview cache coherency memory models 2

3 Speedup An application runs in time T 1 on one processor, T p for p processors. Measured speedup on p processors: T 1 T p Can we bound/estimate speedup without running the application? 3

4 Scaling Models Intuition Consider the following application: auto A = InitializeArray(random_seed, important_info); for (int i = 0; i < 10000; ++i) for (int j = 0; j < 10000; ++j) B[i][j] = Compute(A[i][j]); SequentialDecisionProcess(B); for (int i = 0; i < 10000; ++i) for (int j = 0; j < 10000; ++j) C[i][j] = Compute(B[i][j]); WriteResults(C); Sequential Parallel Sequential Parallel Sequential 4

5 Amdahl s Law Given f 0, 1 the sequential fraction of an application: T p 1 f T 1 p + f T 1 Gene Amdahl

6 Scaling Models Intuition (revisited) Consider the following application: auto A = InitializeArray(random_seed, important_info); for (int i = 0; i < 10000; ++i) for (int j = 0; j < 10000; ++j) B[i][j] = Compute(A[i][j]); Sequential Parallel f T 1 1 f SequentialDecisionProcess(B); Sequential for (int i = 0; i < 10000; ++i) for (int j = 0; j < 10000; ++j) C[i][j] = Compute(B[i][j]); Parallel T 2 WriteResults(C); Sequential f 1 f 2 1 f 2 6

$Amdahl s Law Given f 0, 1 the sequential fraction of an application: T p 1 f T 1 p Speedup can be$

7 Amdahl s Law Given f 0, 1 the sequential fraction of an application: T p 1 f T 1 p Speedup can be modeled as: And efficiency: S p T 1 T p 1 1 f p + f T 1 + f E p = S p p 1 1 f + fp Gene Amdahl

8 (1-f)/2 T 1 (1-f)/3 T 1 (1-f)/4 T 1 (1-f)/5 T 1 Amdahl s Law: Runtime Runtime ft 1 T 1 ft 1 ft 1 ft1 ft1 (1-f)T 1 #Processors/Cores

9 Amdahl s Law: Speedup Speedup linear speedup f = 1/15 f = 1/ Number of Cores

10 Amdahl s Law As p : T f T 1 S 1 f E = 0 (if f 0) 10

11 Question Is Amdahl s law pessimistic or optimistic? 11

12 Amdahl s Law is Pessimistic Fixes the problem size More processors Larger problems f is not a constant fraction, but f(n) The same applies for T 1 and T 1 (n) 12

13 Case Study: Weather Forecasting Grid point climate/atmospheric models GFS GME COSMO The denser the grid, the more accurate prediction becomes Important for anticipating natural disasters! 13

14 Gustafson s Law States: S 1 f n n (if f n 0) John Gustafson Larger problems can be solved (with more resources) at the same time. 14

15 Strong and Weak Scaling Strong scaling: Behavior of S p (n) for a fixed n, and p Fixed problem size as p increases Weak scaling: Behavior of S p (n) for n, p Increasing problem size as p increases 15

16 Pessimism/Optimism Revisited Could both models be optimistic? 16

17 Runtime With Parallel Overhead Runtime Parallel overhead increases with # of processors #Processors/Cores

18 Other Issue: Load Balancing Amdahl s Law assumes that computation divides evenly Processor waiting time is another form of overhead Task started Working time Task completed Waiting time

19 Runtime with Parallel Overhead and Work Imbalance Runtime Time lost due to work imbalance #Processors/Cores

20 Speedup Real Speedup Graphs often Look like This: linear speedup Number of Processors

21 Additional Resources Assumption in both laws: increasing p does not change other resources (e.g. cache) It is thus possible to achieve super-linear speedup: Example 1: Cache size increase leads to larger working set Example 2: Probabilistic convergent algorithms 21

22 Amdahl s Law: Summary Strong, simple models for scaling Ignore overhead incurred by parallelization and load imbalance In reality: S p n = T 1 (n) T p n +A p (n) Programs are not always that simple f does not always exist 22

PRAM Model Parallel Random Access Machine Abstract model for shared memory parallelism Shared Memory Any processor can access the memory at unit time What about

23 PRAM Model Parallel Random Access Machine Abstract model for shared memory parallelism Shared Memory Any processor can access the memory at unit time What about access conflicts? Exclusive read exclusive write (EREW) Concurrent read exclusive write (CREW) Concurrent read concurrent write (CRCW) Proc. 1 Proc. 2 Proc. p 23

24 PRAM Model: Program Representation Programs represented as DAGs (Directed Acyclic Graphs) Nodes: Unit-time operations Edges: Dependencies Inputs We can now define: W n = #nodes D n =longest path from input to output Average parallelism: W(n)/D(n) 7 6 Outputs 8 24

25 PRAM Model: Advantages Powerful tool for analyzing parallel algorithms No need to manage communication and synchronization Focus on algorithmic aspects Time-unit and processor abstraction is robust w.r.t. architectures 25

26 PRAM Model: Examples Consider reduction: (x 1 + x x n ) Sequential: n W n = Θ(n) D n = Θ(n) Parallelism = O(1) W n = Θ(n) Binary tree: n n-1 n-1 + n D n = Θ(log n) Parallelism = O(n/log n) 26

27 PRAM Model: Examples Merge-sort: List MergeSort(List L) { if (L.length() == 1) return L; auto L1 = MergeSort(L.sublist(0, n / 2)); auto L2 = MergeSort(L.sublist(n / 2 + 1, n)); return merge(l1, L2); } W n = Θ(n log n) D n = D n 2 + Θ n = Θ(n) Average parallelism: Θ(log n) Animation: u/morolin 27

28 PRAM Model: Examples Images: The PRAM Model and Algorithms Prof. Robert van Engelen Scan (prefix sum): Input: L = (x 1,, x n ) Output: 0, x 1, x 1 + x 2,, x i List Scan(List L) { if (L.length() == 1) return List { 0 }; List sums (L.length()/2); for (int i = 0; i < L.length()/2; ++i) sums[i] = L[2 * i - 1] + L[2 * i]; List evens = Scan(sums); List odds(ceil(l.length() / 2.0)); for (int i = 0; i < odds.length(); ++i) odds[i] = evens[i] + L[2 * i]; } return Interleave(evens, odds); 28

29 PRAM Model: Examples Scan (prefix sum): Input: L = (x 1,, x n ) Output: 0, x 1, x 1 + x 2,, x i List Scan(List L) { if (L.length() == 1) return List { 0 }; List sums (L.length()/2); for (int i = 0; i < L.length()/2; ++i) sums[i] = L[2 * i - 1] + L[2 * i]; List evens = Scan(sums); List odds(ceil(l.length() / 2.0)); for (int i = 0; i < odds.length(); ++i) odds[i] = evens[i] + L[2 * i]; W n = W n + Θ n = Θ(n) 2 D n = D n + Θ 1 = Θ(log n) 2 Parallelism: O(n/log n) } return Interleave(evens, odds); 29

30 Reasoning in PRAM Given a DAG with W(n) nodes and D(n) depth: Sequential runtime: T 1 n = W n Time on processors: T n = D(n) Time on p processors: T p n =? T p n D(n), T p n W(n)/p Can we bound T p from above? 30

31 Brent s Theorem T p n D n + W n D n p Inputs Intuition: Bound parallelism at each level of the DAG Outputs 31

32 Brent s Theorem T p breakdown s 1 operations s 1 p 4 5 s 2 operations s 2 p s D operations s D p D s i = W i=1 32

33 Brent s Theorem: Proof Observe: a DAG of depth D can be partitioned to D levels, s.t. there are no interdependencies between nodes at the same level i At each DAG level i, we define T p as the time for layer i and p processors, s i (= T i 1 ) denotes the number of nodes in level i, thus: T i p = s i p s i + p 1 p It follows that: D T p n i=1 D si T i + p 1 p p i=1 33

34 Brent s Theorem: Proof D si + p 1 T p n p i=1 = = 1 p W n + p 1 p D n = = D n + W n D n p D n + W n p To conclude: T 1 p = W n p T p n W n p + D(n) = T 1 p + T 34

35 Speedups S p n = T 1 n T p n S p n p S p n W n D n p D n W n W n p+1 p D n S n = W n D n For fixed n, speedup is limited. For n, speedup can be unbounded Example: Tree reduction 35

36 Concurrency Communication Models: α-β Model Gist: α = Latency (cycles), β = Bandwidth (units/cycle) A Latency α Bandwidth β B How long does it take to send a message of size n? α 1 β n Time 36

37 Concurrency α-β Model α 1 β n Time T n = α + n 1 β 37

38 Little s Law: Primer Problem 1: In Tannenbar, every minute on average 2 students come and leave Each student spends 8 minutes inside Q: How many students are in Tannenbar at a given time? A: 16 Problem 2: In your fridge, you have 150 bottles of Vivi Kola (on average) You drink and buy on average 25 bottles per week Q: How many weeks does every bottle last? A: 6 38

Little s Law Queueing theory In a stationary system (input rate = output rate), the number of customers N in a queue is given by: N = λw Where λ = arrival rate and W = average time

39 Little s Law Queueing theory In a stationary system (input rate = output rate), the number of customers N in a queue is given by: N = λw Where λ = arrival rate and W = average time spent in the system John Little Simple rule, yet powerful: N is independent of I/O distribution Connection: in the α-β model, arrival rate is β and time spent in system is α cycles 39

40 Concurrency Little s Law and the α-β Model β Units arrive per cycle Communication Medium β Units leave per cycle α Cycles spent in system 1 β N Concurrent units in transit N = α β Time 40

41 Example: Memory Systems Latency Throughput = Concurrency Latency is reduced by half every 9 years Throughput doubles every 3 years Parallel processing beneficial! Memory Cache Core 1 Core 2 Real-world statistics: Intel Core 2 (2006): α = 100 cycles, β = 2 bytes/cycle N = 200 Intel Haswell (2014): α = 63 cycles, β = 23 bytes/cycle N =

Parallel Performance Theory - 1

Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science Outline q Performance scalability q Analytical performance measures q Amdahl s law and Gustafson-Barsis