Analysis of Multithreaded Algorithms

Similar documents
Plan. Analysis of Multithreaded Algorithms. Plan. Matrix multiplication. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza

CS 240A : Examples with Cilk++

Grade 11/12 Math Circles Fall Nov. 5 Recurrences, Part 2

CS 4407 Algorithms Lecture 3: Iterative and Divide and Conquer Algorithms

COMP Analysis of Algorithms & Data Structures

COMP Analysis of Algorithms & Data Structures

CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms

Chapter 2. Recurrence Relations. Divide and Conquer. Divide and Conquer Strategy. Another Example: Merge Sort. Merge Sort Example. Merge Sort Example

Data Structures and Algorithms Chapter 3

Divide and Conquer. Andreas Klappenecker

CS Non-recursive and Recursive Algorithm Analysis

Introduction to Algorithms 6.046J/18.401J/SMA5503

Analysis of Algorithms - Using Asymptotic Bounds -

Cpt S 223. School of EECS, WSU

Asymptotic Algorithm Analysis & Sorting

Data Structures and Algorithms CMPSC 465

COMPUTER ALGORITHMS. Athasit Surarerks.

CSE 613: Parallel Programming. Lecture 8 ( Analyzing Divide-and-Conquer Algorithms )

EECS 477: Introduction to algorithms. Lecture 5

COMP 9024, Class notes, 11s2, Class 1

Data Structures and Algorithms Chapter 3

Review Of Topics. Review: Induction

What we have learned What is algorithm Why study algorithm The time and space efficiency of algorithm The analysis framework of time efficiency Asympt

In-Class Soln 1. CS 361, Lecture 4. Today s Outline. In-Class Soln 2

COMP 382: Reasoning about algorithms

Module 1: Analyzing the Efficiency of Algorithms

Analysis of Algorithms I: Asymptotic Notation, Induction, and MergeSort

Divide and Conquer CPE 349. Theresa Migler-VonDollen

1 Substitution method

i=1 i B[i] B[i] + A[i, j]; c n for j n downto i + 1 do c n i=1 (n i) C[i] C[i] + A[i, j]; c n

Analysis of Algorithms

CS473 - Algorithms I

Asymptotic Analysis. Slides by Carl Kingsford. Jan. 27, AD Chapter 2

Divide and Conquer. Andreas Klappenecker. [based on slides by Prof. Welch]

Data Structures and Algorithms CSE 465

Data structures Exercise 1 solution. Question 1. Let s start by writing all the functions in big O notation:

CS 577 Introduction to Algorithms: Strassen s Algorithm and the Master Theorem

MA008/MIIZ01 Design and Analysis of Algorithms Lecture Notes 3

Big O 2/14/13. Administrative. Does it terminate? David Kauchak cs302 Spring 2013

data structures and algorithms lecture 2

Algorithm efficiency analysis

Algorithms, Design and Analysis. Order of growth. Table 2.1. Big-oh. Asymptotic growth rate. Types of formulas for basic operation count

Computational Complexity

When we use asymptotic notation within an expression, the asymptotic notation is shorthand for an unspecified function satisfying the relation:

CSE 613: Parallel Programming. Lectures ( Analyzing Divide-and-Conquer Algorithms )

When we use asymptotic notation within an expression, the asymptotic notation is shorthand for an unspecified function satisfying the relation:

CS 5321: Advanced Algorithms Analysis Using Recurrence. Acknowledgement. Outline

Growth of Functions (CLRS 2.3,3)

The Divide-and-Conquer Design Paradigm

A design paradigm. Divide and conquer: (When) does decomposing a problem into smaller parts help? 09/09/ EECS 3101

Lecture 22: Multithreaded Algorithms CSCI Algorithms I. Andrew Rosenberg

CS 470/570 Divide-and-Conquer. Format of Divide-and-Conquer algorithms: Master Recurrence Theorem (simpler version)

CMPS 2200 Fall Divide-and-Conquer. Carola Wenk. Slides courtesy of Charles Leiserson with changes and additions by Carola Wenk

Module 1: Analyzing the Efficiency of Algorithms

CS 5321: Advanced Algorithms - Recurrence. Acknowledgement. Outline. Ali Ebnenasir Department of Computer Science Michigan Technological University

The maximum-subarray problem. Given an array of integers, find a contiguous subarray with the maximum sum. Very naïve algorithm:

Design and Analysis of Algorithms Recurrence. Prof. Chuhua Xian School of Computer Science and Engineering

Asymptotic Analysis 1

More Asymptotic Analysis Spring 2018 Discussion 8: March 6, 2018

Data Structures and Algorithms Running time and growth functions January 18, 2018

MA008/MIIZ01 Design and Analysis of Algorithms Lecture Notes 2

CIS 121 Data Structures and Algorithms with Java Spring Big-Oh Notation Monday, January 22/Tuesday, January 23

3.1 Asymptotic notation

Fast Convolution; Strassen s Method

Data Structures and Algorithms. Asymptotic notation

Divide-and-Conquer Algorithms and Recurrence Relations. Niloufar Shafiei

CS 4407 Algorithms Lecture 2: Growth Functions

CSCI 3110 Assignment 6 Solutions

Ch01. Analysis of Algorithms

Lecture 1: Asymptotics, Recurrences, Elementary Sorting

Analysis of Algorithms

CS483 Design and Analysis of Algorithms

Solving recurrences. Frequently showing up when analysing divide&conquer algorithms or, more generally, recursive algorithms.

Chapter 4 Divide-and-Conquer

Problem Set 1 Solutions

Lecture 2. Fundamentals of the Analysis of Algorithm Efficiency

CSCI Honor seminar in algorithms Homework 2 Solution

Analysis of Algorithms

Algorithm Design and Analysis

Introduction to Algorithms 6.046J/18.401J

Introduction to Algorithms

Algorithms Design & Analysis. Analysis of Algorithm

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d

Agenda. We ve discussed

Chapter 5. Divide and Conquer CLRS 4.3. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Divide-and-conquer: Order Statistics. Curs: Fall 2017

Chapter 5. Divide and Conquer CLRS 4.3. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Divide and Conquer. Arash Rafiey. 27 October, 2016

CS483 Design and Analysis of Algorithms

Analysis of Algorithm Efficiency. Dr. Yingwu Zhu

Divide and Conquer. CSE21 Winter 2017, Day 9 (B00), Day 6 (A00) January 30,

Computational Complexity - Pseudocode and Recursions

Introduction to Algorithms: Asymptotic Notation

Data Structures and Algorithms Chapter 2

Divide and Conquer. Recurrence Relations

Algorithms, CSE, OSU. Introduction, complexity of algorithms, asymptotic growth of functions. Instructor: Anastasios Sidiropoulos

Appendix: Solving Recurrences

The Time Complexity of an Algorithm

The Time Complexity of an Algorithm

Running Time Evaluation

Transcription:

Analysis of Multithreaded Algorithms Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 1 / 47

Plan 1 Review of Complexity Notions 2 Divide-and-Conquer Recurrences 3 Matrix Multiplication 4 Merge Sort 5 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 2 / 47

Plan Review of Complexity Notions 1 Review of Complexity Notions 2 Divide-and-Conquer Recurrences 3 Matrix Multiplication 4 Merge Sort 5 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 3 / 47

Orders of magnitude Review of Complexity Notions Let f, g et h be functions from N to R. We say that g(n) is in the order of magnitude of f (n) and we write f (n) Θ(g(n)) if there exist two strictly positive constants c 1 and c 2 such that for n big enough we have 0 c 1 g(n) f (n) c 2 g(n). (1) We say that g(n) is an asymptotic upper bound of f (n) and we write f (n) O(g(n)) if there exists a strictly positive constants c 2 such that for n big enough we have 0 f (n) c 2 g(n). (2) We say that g(n) is an asymptotic lower bound of f (n) and we write f (n) Ω(g(n)) if there exists a strictly positive constants c 1 such that for n big enough we have 0 c 1 g(n) f (n). (3) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 4 / 47

Examples Review of Complexity Notions With f (n) = 1 2 n2 3n and g(n) = n 2 we have f (n) Θ(g(n)). Indeed we have c 1 n 2 1 2 n2 3n c 2 n 2. (4) for n 12 with c 1 = 1 4 and c 2 = 1 2. Assume that there exists a positive integer n 0 such that f (n) > 0 and g(n) > 0 for every n n 0. Then we have max(f (n), g(n)) Θ(f (n) + g(n)). (5) Indeed we have 1 (f (n) + g(n)) 2 max(f (n), g(n)) (f (n) + g(n)). (6) Assume a and b are positive real constants. Then we have Indeed for n a we have (n + a) b Θ(n b ). (7) (Moreno Maza) Analysis b of Multithreaded Algorithms b b CS 4435 - CS 9624 5 / 47

Review of Complexity Notions Properties f (n) Θ(g(n)) holds iff f (n) O(g(n)) and f (n) Ω(g(n)) hold together. Each of the predicates f (n) Θ(g(n)), f (n) O(g(n)) and f (n) Ω(g(n)) define a reflexive and transitive binary relation among the N-to-R functions. Moreover f (n) Θ(g(n)) is symmetric. We have the following transposition formula f (n) O(g(n)) g(n) Ω(f (n)). (9) In practice is replaced by = in each of the expressions f (n) Θ(g(n)), f (n) O(g(n)) and f (n) Ω(g(n)). Hence, the following f (n) = h(n) + Θ(g(n)) (10) means f (n) h(n) Θ(g(n)). (11) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 6 / 47

Review of Complexity Notions Another example Let us give another fundamental example. Let p(n) be a (univariate) polynomial with degree d > 0. Let a d be its leading coefficient and assume a d > 0. Then we have (1) if k d then p(n) O(n k ), (2) if k d then p(n) Ω(n k ), (3) if k = d then p(n) Θ(n k ). Exercise: Prove the following Σ k=n k=1 k Θ(n2 ). (12) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 7 / 47

Plan Divide-and-Conquer Recurrences 1 Review of Complexity Notions 2 Divide-and-Conquer Recurrences 3 Matrix Multiplication 4 Merge Sort 5 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 8 / 47

Divide-and-Conquer Recurrences Divide-and-Conquer Algorithms Divide-and-conquer algorithms proceed as follows. Divide the input problem into sub-problems. Conquer on the sub-problems by solving them directly if they are small enough or proceed recursively. Combine the solutions of the sub-problems to obtain the solution of the input problem. Equation satisfied by T (n). Assume that the size of the input problem increases with an integer n. Let T (n) be the time complexity of a divide-and-conquer algorithm to solve this problem. Then T (n) satisfies an equation of the form: T (n) = a T (n/b) + f (n). (13) where f (n) is the cost of the combine-part, a 1 is the number of recursively calls and n/b with b > 1 is the size of a sub-problem. (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 9 / 47

Divide-and-Conquer Recurrences Tree associated with a divide-and-conquer recurrence Labeled tree associated with the equation. Assume n is a power of b, say n = b p. To solve the equation T (n) = a T (n/b) + f (n). we can associate a labeled tree A(n) to it as follows. (1) If n = 1, then A(n) is reduced to a single leaf labeled T (1). (2) If n > 1, then the root of A(n) is labeled by f (n) and A(n) possesses a labeled sub-trees all equal to A(n/b). The labeled tree A(n) associated with T (n) = a T (n/b) + f (n) has height p + 1. Moreover the sum of its labels is T (n). (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 10 / 47

Divide-and-Conquer Recurrences Solving divide-and-conquer recurrences (1/2) T(n) f(n) a T(n/b) f(n/b) T(n/b) f(n/b) T(n/b) f(n/b) a T(n/b 2 ) T(n/b 2 ) T(n/b 2 ) T(n/b) T(n) f(n) T(n/b) a T(n/b) f(n) a f(n/b) f(n/b) f(n/b) a T(n/b f(n/b 2 ) T(n/b f(n/b 2 ) T(n/b f(n/b 2 ) T(1) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 11 / 47

Divide-and-Conquer Recurrences Solving divide-and-conquer recurrences (2/2) h = log b n f(n/b 2 ) f(n) a f(n) f(n/b) f(n/b) f(n/b) a f(n/b) a f(n/b 2 ) f(n/b 2 ) a 2 f(n/b 2 ) T(1) a log b n T(1) = Θ(n log b a ) IDEA: Compare n log ba with f(n). (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 12 / 47

Divide-and-Conquer Recurrences Master Theorem: case n log ba f (n) h = log b n f(n/b 2 ) T(1) f(n) a f(n) f(n/b) f(n/b) f(n/b) n log ab f(n) a f(n/b) f(n/b GEOMETRICALLY 2 ) f(n/b 2 ) a 2 f(n/b 2 ) INCREASING Specifically, f(n) = O(n log b a ) for some constant ε > 0. a log b n T(1) T(n) = Θ(n log b a ) = Θ(n log b a ) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 13 / 47

Divide-and-Conquer Recurrences Master Theorem: case f (n) Θ(n log ba log k n) h = log b n f(n/b 2 ) f(n) f(n) a f(n/b) n f(n/b) log b a f(n) f(n/b) a f(n/b) a ARITHMETICALLY f(n/b 2 ) INCREASING f(n/b 2 ) a 2 f(n/b 2 ) T(1) Specifically, f(n) = Θ(n( log b a lg k n) for some constant k 0. T(n) = Θ(n log b a lg k+1 n)) a log b n T(1) = Θ(n log b a ) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 14 / 47

Divide-and-Conquer Recurrences Master Theorem: case where f (n) n log ba f(n) n log b a f(n) a f(n) f(n/b) GEOMETRICALLY f(n/b) f(n/b) a f(n/b) DECREASING a f(n/b 2 ) f(n/b Specifically, 2 ) f(n/b 2 f(n) ) = a 2 f(n/b 2 ) Ω(n log b a+ ε ) for some constant ε > 0.* h = log b n T(1) T(n) = Θ(f(n)) a log b n T(1) = Θ(n log b a ) *and f(n) satisfies the regularity condition that a f(n/b) cf(n) for some constant c < 1. (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 15 / 47

More examples Divide-and-Conquer Recurrences Consider the relation: We obtain: Hence we have: T (n) = 2 T (n/2) + n 2. (14) T (n) = n 2 + n2 2 + n2 4 + n2 8 + + n2 + n T (1). (15) 2p Consider the relation: T (n) Θ(n 2 ). (16) T (n) = 3T (n/3) + n. (17) We obtain: T (n) Θ(log 3 (n)n). (18) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 16 / 47

Divide-and-Conquer Recurrences Master Theorem when b = 2 Let a > 0 be an integer and let f, T : N R + be functions such that (i) f (2 n) 2 f (n) and f (n) n. (ii) If n = 2 p then T (n) a T (n/2) + f (n). Then for n = 2 p we have (1) if a = 1 then (2) if a = 2 then T (n) (2 2/n) f (n) + T (1) O(f (n)), (19) T (n) f (n) log 2 (n) + T (1) n O(log 2 (n) f (n)), (20) (3) if a 3 then T (n) 2 ( ) n log 2 (a) 1 1 f (n)+t (1) n log 2 (a) O(f (n) n log 2 (a) 1 ). a 2 (21) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 17 / 47

Divide-and-Conquer Recurrences Master Theorem when b = 2 Indeed T (2 p ) a T (2 p 1 ) + f (2 p ) a [ a T (2 p 2 ) + f (2 p 1 ) ] + f (2 p ) = a 2 T (2 p 2 ) + a f (2 p 1 ) + f (2 p ) a 2 [ a T (2 p 3 ) + f (2 p 2 ) ] + a f (2 p 1 ) + f (2 p ) = a 3 T (2 p 3 ) + a 2 f (2 p 2 ) + a f (2 p 1 ) + f (2 p ) a p T (s1) + σ j=p 1 j=0 a j f (2 p j ) (22) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 18 / 47

Divide-and-Conquer Recurrences Master Theorem when b = 2 Moreover Thus f (2 p ) 2 f (2 p 1 ) f (2 p ) 2 2 f (2 p 2 )... f (2 p ) 2 j f (2 p j ) (23) ( Σ j=p 1 j=0 a j f (2 p j ) f (2 p ) Σ j=p 1 a ) j. j=0 (24) 2 (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 19 / 47

Divide-and-Conquer Recurrences Master Theorem when b = 2 Hence For a = 1 we obtain T (2 p ) a p T (1) + f (2 p ) Σ j=p 1 j=0 ( a 2 ) j. (25) T (2 p ) T (1) + f (2 p ) Σ j=p 1 ( 1 j j=0 2) = T (1) + f (2 p ) 1 2 p 1 1 2 1 = T (1) + f (n) (2 2/n). (26) For a = 2 we obtain T (2 p ) 2 p T (1) + f (2 p ) p = n T (1) + f (n) log 2 (n). (27) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 20 / 47

Divide-and-Conquer Recurrences Master Theorem cheat sheet For a 1 and b > 1, consider again the equation T (n) = a T (n/b) + f (n). (28) for any ε > 0 we have: f (n) O(n log b a ε ) = T (n) Θ(n log b a ) (29) for any k 0 we have: f (n) Θ(n log b a log k n) = T (n) Θ(n log b a log k+1 n) (30) for any ε > 0 we have: f (n) Ω(n log b a+ε ) = T (n) Θ(f (n)) (31) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 21 / 47

Divide-and-Conquer Recurrences Master Theorem quizz! T (n) = 4T (n/2) + n T (n) = 4T (n/2) + n 2 T (n) = 4T (n/2) + n 3 T (n) = 4T (n/2) + n 2 /logn (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 22 / 47

Plan Matrix Multiplication 1 Review of Complexity Notions 2 Divide-and-Conquer Recurrences 3 Matrix Multiplication 4 Merge Sort 5 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 23 / 47

Matrix multiplication Matrix Multiplication c 11 c 12 c 1n a 11 a 12 a 1n b 11 b 12 b 1n c 21 c 22 c 2n a 21 a 22 a 2n b 21 b 22 b = 2n c n1 c n2 c nn a n1 a n2 a nn b n1 b n2 b nn C A B We will study three approaches: a naive and iterative one a divide-and-conquer one a divide-and-conquer one with memory management consideration (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 24 / 47

Matrix Multiplication Naive iterative matrix multiplication cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<n; ++j) { for (int k=0; k<n; ++k { C[i][j] += A[i][k] * B[k][j]; } } Work:? Span:? Parallelism:? (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 25 / 47

Matrix Multiplication Naive iterative matrix multiplication cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<n; ++j) { for (int k=0; k<n; ++k { C[i][j] += A[i][k] * B[k][j]; } } Work: Θ(n 3 ) Span: Θ(n) Parallelism: Θ(n 2 ) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 26 / 47

Matrix Multiplication Matrix multiplication based on block decomposition C 11 C 12 A 11 A 12 B 11 B 12 = C 21 C 22 A 21 A 22 B 21 B 22 A 11 B 11 A 11 B 12 A 12 B 21 A 12 B 22 = + A 21 B 11 A 21 B 12 A 22 B 21 A 22 B 22 21 11 21 12 22 21 22 22 The divide-and-conquer approach is simply the one based on blocking, presented in the first lecture. (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 27 / 47

Matrix Multiplication Divide-and-conquer matrix multiplication // C <- C + A * B void MMult(T *C, T *A, T *B, int n, int size) { T *D = new T[n*n]; //base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2, size); cilk_spawn MMult(C12, A11, B12, n/2, size); cilk_spawn MMult(C22, A21, B12, n/2, size); cilk_spawn MMult(C21, A21, B11, n/2, size); cilk_spawn MMult(D11, A12, B21, n/2, size); cilk_spawn MMult(D12, A12, B22, n/2, size); cilk_spawn MMult(D22, A22, B22, n/2, size); MMult(D21, A22, B21, n/2, size); cilk_sync; MAdd(C, D, n, size); // C += D; delete[] D; } Work? Span? Parallelism? (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 28 / 47

Matrix Multiplication Divide-and-conquer matrix multiplication void MMult(T *C, T *A, T *B, int n, int size) { T *D = new T[n*n]; //base case & partition matrices cilk_spawn MMult(C11, A11, B11, n/2, size); cilk_spawn MMult(C12, A11, B12, n/2, size); cilk_spawn MMult(C22, A21, B12, n/2, size); cilk_spawn MMult(C21, A21, B11, n/2, size); cilk_spawn MMult(D11, A12, B21, n/2, size); cilk_spawn MMult(D12, A12, B22, n/2, size); cilk_spawn MMult(D22, A22, B22, n/2, size); MMult(D21, A22, B21, n/2, size); cilk_sync; MAdd(C, D, n, size); // C += D; delete[] D; } A p (n) and M p (n): times on p proc. for n n Add and Mult. A 1 (n) = 4A 1 (n/2) + Θ(1) = Θ(n 2 ) A (n) = A (n/2) + Θ(1) = Θ(lg n) M 1 (n) = 8M 1 (n/2) + A 1 (n) = 8M 1 (n/2) + Θ(n 2 ) = Θ(n 3 ) M (n) = M (n/2) + Θ(lg n) = Θ(lg 2 n) M 1 (n)/m (n) = Θ(n 3 / lg 2 n) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 29 / 47

Matrix Multiplication Divide-and-conquer matrix multiplication: No temporaries! template <typename T> void MMult2(T *C, T *A, T *B, int n, int size) { //base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2, size); cilk_spawn MMult2(C12, A11, B12, n/2, size); cilk_spawn MMult2(C22, A21, B12, n/2, size); MMult2(C21, A21, B11, n/2, size); cilk_sync; cilk_spawn MMult2(C11, A12, B21, n/2, size); cilk_spawn MMult2(C12, A12, B22, n/2, size); cilk_spawn MMult2(C22, A22, B22, n/2, size); MMult2(C21, A22, B21, n/2, size); cilk_sync; } Work? Span? Parallelism? (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 30 / 47

Matrix Multiplication Divide-and-conquer matrix multiplication: No temporaries! template <typename T> void MMult2(T *C, T *A, T *B, int n, int size) { //base case & partition matrices cilk_spawn MMult2(C11, A11, B11, n/2, size); cilk_spawn MMult2(C12, A11, B12, n/2, size); cilk_spawn MMult2(C22, A21, B12, n/2, size); MMult2(C21, A21, B11, n/2, size); cilk_sync; cilk_spawn MMult2(C11, A12, B21, n/2, size); cilk_spawn MMult2(C12, A12, B22, n/2, size); cilk_spawn MMult2(C22, A22, B22, n/2, size); MMult2(C21, A22, B21, n/2, size); cilk_sync; } MA p (n): time on p proc. for n n Mult-Add. MA 1 (n) = Θ(n 3 ) MA (n) = 2MA (n/2) + Θ(1) = Θ(n) MA 1 (n)/ma (n) = Θ(n 2 ) Besides, saving space often saves time due to hierarchical memory. (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 31 / 47

Plan Merge Sort 1 Review of Complexity Notions 2 Divide-and-Conquer Recurrences 3 Matrix Multiplication 4 Merge Sort 5 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 32 / 47

(Moreno Maza) Analysis of Multithreaded Algorithms 4 14 21 23 CS 4435 - CS 9624 33 / 47 Merging two sorted arrays Merge Sort void Merge(T *C, T *A, T *B, int na, int nb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; } } while (na>0) { *C++ = *A++; na--; } while (nb>0) { *C++ = *B++; nb--; } } Time for merging n elements is Θ(n). 3 12 19 46

Merge sort Merge Sort merge merge merge 3 4 12 14 19 21 33 46 3 12 19 46 4 14 21 33 3 19 12 46 4 33 14 21 19 3 12 46 33 4 21 14 (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 34 / 47

Merge Sort Parallel merge sort with serial merge template <typename T> void MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T* C[n]; cilk_spawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); } Work? Span? (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 35 / 47

Merge Sort Parallel merge sort with serial merge template <typename T> void MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T* C[n]; cilk_spawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); } T 1 (n) = 2T 1 (n/2) + Θ(n) thus T 1 (n) == Θ(n lg n). T (n) = T (n/2) + Θ(n) thus T (n) = Θ(n). T 1 (n)/t (n) = Θ(lg n). Puny parallelism! We need to parallelize the merge! (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 36 / 47

Parallel merge Merge Sort A 0 ma = na/2 na A[ma] A[ma] Recursive P_Merge Binary Search Recursive P_Merge B A[ma] A[ma] 0 mb-1 mb nb na nb Idea: if the total number of elements to be sorted in n = n a + n b then the maximum number of elements in any of the two merges is at most 3n/4. (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 37 / 47

One should coarse the base case for efficiency. Work? Span? (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 38 / 47 Parallel merge Merge Sort template <typename T> void P_Merge(T *C, T *A, T *B, int na, int nb) { if (na < nb) { P_Merge(C, B, A, nb, na); } else if (na==0) { return; } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); C[ma+mb] = A[ma]; cilk_spawn P_Merge(C, A, B, ma, mb); P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb); cilk_sync; } }

Merge Sort Parallel merge template <typename T> void P_Merge(T *C, T *A, T *B, int na, int nb) { if (na < nb) { P_Merge(C, B, A, nb, na); } else if (na==0) { return; } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); C[ma+mb] = A[ma]; cilk_spawn P_Merge(C, A, B, ma, mb); P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb); cilk_sync; } } Let PM p (n) be the p-processor running time of P-Merge. In the worst case, the span of P-Merge is PM (n) PM (3n/4) + Θ(lg n) = Θ(lg 2 n) The worst-case work of P-Merge satisfies the recurrence PM 1 (n) PM 1 (αn) + PM 1 ((1 α)n) + Θ(lg n) (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 39 / 47

Merge Sort Analyzing parallel merge Recall PM 1 (n) PM 1 (αn) + PM 1 ((1 α)n) + Θ(lg n) for some 1/4 α 3/4. To solve this hairy equation we use the substitution method. We assume there exist some constants a, b > 0 such that PM 1 (n) an b lg n holds for all 1/4 α 3/4. After substitution, this hypothesis implies: PM 1 (n) an b lg n b lg n + Θ(lg n). We can pick b large enough such that we have PM 1 (n) an b lg n for all 1/4 α 3/4 and all n > 1/ Then pick a large enough to satisfy the base conditions. Finally we have PM 1 (n) = Θ(n). (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 40 / 47

Merge Sort Parallel merge sort with parallel merge template <typename T> void P_MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T C[n]; cilk_spawn P_MergeSort(C, A, n/2); P_MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; P_Merge(B, C, C+n/2, n/2, n-n/2); } } Work? Span? (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 41 / 47

Merge Sort Parallel merge sort with parallel merge template <typename T> void P_MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T C[n]; cilk_spawn P_MergeSort(C, A, n/2); P_MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; P_Merge(B, C, C+n/2, n/2, n-n/2); } } The work satisfies T 1 (n) = 2T 1 (n/2) + Θ(n) (as usual) and we have T 1 (n) = Θ(nlog(n)). The worst case critical-path length of the Merge-Sort now satisfies T (n) = T (n/2) + Θ(lg 2 n) = Θ(lg 3 n). The parallelism is now Θ(n lg n)/θ(lg 3 n) = Θ(n/ lg 2 n). (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 42 / 47

Plan Tableau Construction 1 Review of Complexity Notions 2 Divide-and-Conquer Recurrences 3 Matrix Multiplication 4 Merge Sort 5 Tableau Construction (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 43 / 47

Tableau construction Tableau Construction 00 01 02 03 04 05 06 07 10 11 12 13 14 15 16 17 20 21 22 23 24 25 26 27 30 31 32 33 34 35 36 37 40 41 42 43 44 45 46 47 50 51 52 53 54 55 56 57 60 61 62 63 64 65 66 67 70 71 72 73 74 75 76 77 Constructing a tableau A satisfying a relation of the form: A[i, j] = R(A[i 1, j], A[i 1, j 1], A[i, j 1]). (32) The work is Θ(n 2 ). (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 44 / 47

Recursive construction Tableau Construction n Parallel code n I II I; cilk_spawn II; III; cilk_sync; IV; III IV T 1 (n) = 4T 1 (n/2) + Θ(1), thus T 1 (n) = Θ(n 2 ). T (n) = 3T (n/2) + Θ(1), thus T (n) = Θ(n log 2 3 ). Parallelism: Θ(n 2 log 2 3 ) = Ω(n 0.41 ). (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 45 / 47

Tableau Construction A more parallel construction n n I II IV III V VII VI VIII IX I; cilk_spawn II; III; cilk_sync; cilk_spawn IV; cilk_spawn V; VI; cilk_sync; cilk_spawn VII; VIII; cilk_sync; IX; T 1 (n) = 9T 1 (n/3) + Θ(1), thus T 1 (n) = Θ(n 2 ). T (n) = 5T (n/3) + Θ(1), thus T (n) = Θ(n log 3 5 ). Parallelism: Θ(n 2 log 3 5 ) = Ω(n 0.53 ). This nine-way d-n-c has more parallelism than the four way but exhibits more cache complexity (more on this later). (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 46 / 47

Acknowledgements Tableau Construction Charles E. Leiserson (MIT) for providing me with the sources of its lecture notes. Matteo Frigo (Intel) for supporting the work of my team with Cilk++ and offering us the next lecture. Yuzhen Xie (UWO) for helping me with the images used in these slides. Liyun Li (UWO) for generating the experimental data. (Moreno Maza) Analysis of Multithreaded Algorithms CS 4435 - CS 9624 47 / 47