Communication Lower Bounds for Programs that Access Arrays

Size: px

Start display at page:

Download "Communication Lower Bounds for Programs that Access Arrays"

Marybeth Boyd
5 years ago
Views:

1 Communication Lower Bounds for Programs that Access Arrays Nicholas Knight, Michael Christ, James Demmel, Thomas Scanlon, Katherine Yelick UC-Berkeley Scientific Computing and Matrix Computations Seminar Apr. 24, 2013 We acknowledge funding from Microsoft (award #024263) and Intel (award #024894), and matching funding by UC Discovery (award #DIG ), with additional support from ParLab affiliates National Instruments, Nokia, NVIDIA, Oracle, and Samsung, and support from MathWorks. We also acknowledge the support of the US DOE (grants DE-SC , DE-SC , DE-SC , DE-SC , DE-AC02-05CH11231, DE-FC02-06ER25753, and DE-FC02-07ER25799), DARPA (award #HR ), and NSF (grant DMS ).

2 Communication is expensive! 2 Communication means moving data Serial communication = moving data across memory hierarchy Parallel communication = moving data across network

3 Communication is expensive! 2 Communication means moving data Serial communication = moving data across memory hierarchy Parallel communication = moving data across network Communication usually dominates runtime, energy cost Avoid communication to save time and energy!

4 2 Communication is expensive! Communication means moving data Serial communication = moving data across memory hierarchy Parallel communication = moving data across network Communication usually dominates runtime, energy cost Avoid communication to save time and energy! How much can you avoid? Lower bound on data movement

5 2 Communication is expensive! Communication means moving data Serial communication = moving data across memory hierarchy Parallel communication = moving data across network Communication usually dominates runtime, energy cost Avoid communication to save time and energy! How much can you avoid? Lower bound on data movement Attain lower bound Communication-optimal algorithm

6 Outline 1 Avoiding Communication in Linear Algebra Lower bounds for matrix multiplication attainable by blocking Lower bounds for linear algebra (... attainable? see [BDHS11]) 2 Beyond Linear Algebra: Affine Array References E.g., A(i + 2j, 3k + 4) Extends previous lower bounds to larger class of programs. Lower bounds are computable. Matching upper bounds (i.e., optimal algorithms) in special case: when array references pick a subset of the loop indices (e.g., A(i, k), B(k, j)). Ongoing work addresses attainability in the general case.

7 First Lower Bound: Matrix Multiplication ( Matmul ) A, B, C are N-by-N matrices. C := C + A B for i = 1 : N, for j = 1 : N, for k = 1 : N, C(i, j) += A(i, k) B(k, j) 3

8 First Lower Bound: Matrix Multiplication ( Matmul ) A, B, C are N-by-N matrices. C := C + A B Theorem ([HK81]) for i = 1 : N, for j = 1 : N, for k = 1 : N, C(i, j) += A(i, k) B(k, j) Consider computing C + A B as above (in serial), with any order on the N 3 iterations. A processor must move ( ) #iterations # Words Moved = Ω (fast memory size) 1/2 ( ) N 3 = Ω M 1/2 words between slow memory (of unbounded capacity) and fast memory (of size M words). 3

9 First Lower Bound: Geometric Intuition [ITT04] (1/2) 4 for i = 1 : N, for j = 1 : N, for k = 1 : N C(i, j) += A(i, k) B(k, j)

10 First Lower Bound: Geometric Intuition [ITT04] (1/2) 4 for i = 1 : N, for j = 1 : N, for k = 1 : N C(i, j) += A(i, k) B(k, j) Idea Bound volume(s) by the areas of the shadows S casts

11 First Lower Bound: Geometric Intuition [ITT04] (1/2) 4 for i = 1 : N, for j = 1 : N, for k = 1 : N C(i, j) += A(i, k) B(k, j) Idea Bound volume(s) by the areas of the shadows S casts N C S C x y k A z S A x S = {(i,j,k)} 1 1 i N S B B z y j N

12 First Lower Bound: Geometric Intuition [ITT04] (1/2) 4 for i = 1 : N, for j = 1 : N, for k = 1 : N C(i, j) += A(i, k) B(k, j) Idea Bound volume(s) by the areas of the shadows S casts N C S C x y k A z S A x S = {(i,j,k)} 1 1 i N S B B z y j N S = x y z = (xz zy yx) 1/2 = S A 1/2 S B 1/2 S C 1/2

13 First Lower Bound: Geometric Intuition [ITT04] (1/2) 4 for i = 1 : N, for j = 1 : N, for k = 1 : N C(i, j) += A(i, k) B(k, j) Idea Bound volume(s) by the areas of the shadows S casts N C S C x y N C S C k A z S A x S = {(i,j,k)} 1 1 i N S B B z y j N k A S A S = {(i,j,k)} B 1 1 i N S B j N S = x y z = (xz zy yx) 1/2 = S A 1/2 S B 1/2 S C 1/2

14 First Lower Bound: Geometric Intuition [ITT04] (1/2) 4 for i = 1 : N, for j = 1 : N, for k = 1 : N C(i, j) += A(i, k) B(k, j) Idea Bound volume(s) by the areas of the shadows S casts N C S C x y N C S C k A z S A x S = {(i,j,k)} 1 1 i N S B B z y j N k A S A S = {(i,j,k)} B 1 1 i N S B j N S = x y z = (xz zy yx) 1/2 = S A 1/2 S B 1/2 S C 1/2 S S A 1/2 S B 1/2 S C 1/2 by Loomis-Whitney ineq. [LW49]

15 First Lower Bound: Geometric Intuition [ITT04] (2/2) Idea Bound volume(s) by the areas of the shadows S casts Idea Upper bound on data reuse lower bound on data movement 5

16 First Lower Bound: Geometric Intuition [ITT04] (2/2) Idea Bound volume(s) by the areas of the shadows S casts Idea Upper bound on data reuse lower bound on data movement Upper bound on number of operands: M max( S A, S B, S C ) M S A + S B + S C M 5

17 First Lower Bound: Geometric Intuition [ITT04] (2/2) Idea Bound volume(s) by the areas of the shadows S casts Idea Upper bound on data reuse lower bound on data movement Upper bound on number of operands: M max( S A, S B, S C ) M S A + S B + S C M Upper bound on number of iterations doable given M operands: S S A 1/2 S B 1/2 S C 1/2 M 3/2 5

18 First Lower Bound: Geometric Intuition [ITT04] (2/2) Idea Bound volume(s) by the areas of the shadows S casts Idea Upper bound on data reuse lower bound on data movement Upper bound on number of operands: M max( S A, S B, S C ) M S A + S B + S C M Upper bound on number of iterations doable given M operands: S S A 1/2 S B 1/2 S C 1/2 M 3/2 Data reuse = # iterations/# operands = O(M 1/2 ) (See [BDHS11] for precise argument.) 5

19 First Lower Bound: Geometric Intuition [ITT04] (2/2) Idea Bound volume(s) by the areas of the shadows S casts Idea Upper bound on data reuse lower bound on data movement Upper bound on number of operands: M max( S A, S B, S C ) M S A + S B + S C M Upper bound on number of iterations doable given M operands: S S A 1/2 S B 1/2 S C 1/2 M 3/2 Data reuse = # iterations/# operands = O(M 1/2 ) (See [BDHS11] for precise argument.) # words moved total # iterations/max data reuse = Ω(N 3 /M 1/2 ) 5

20 Attaining Lower Bounds Blocking Matmul (1/3) for i = 1 : N, for j = 1 : N, for k = 1 : N, C(i, j) += A(i, k) B(k, j) j j i = A B C i N 6

21 Attaining Lower Bounds Blocking Matmul (1/3) Suppose M < N. for i = 1 : N, for j = 1 : N, Load C(i, j) for k = 1 : N, Load A(i, k) and B(k, j) C(i, j) += A(i, k) B(k, j) Store C(i, j)... N 2 loads total... 2N 3 loads total... N 2 stores total which is suboptimal. # Words Moved = N 2 + 2N 3 + N 2 = O(N 3 ), 6

22 Attaining Lower Bounds Blocking Matmul (2/3) 7 = A B C

23 Attaining Lower Bounds Blocking Matmul (3/3) Suppose M < N, and let block size b satisfy 3b 2 M and assume b N. for i = 1 : N/b, for j = 1 : N/b, Load block C(i, j) for k = 1 : N/b, Load blocks A(i, k) and B(k, j) C(i, j) += A(i, k) B(k, j) Store block C(i, j)... (N/b) 2 block loads total... 2(N/b) 3 block loads total... (N/b) 2 block stores total and for the choice b = (M/3) 1/2 (assume integer), ( N 2 # Words Moved = which is asymptotically optimal. b 2 + 2N3 b 3 + N2 b 2 ) ( ) N b 2 3 = O M 1/2, 8

24 Lower Bounds for Linear Algebra Theorem ([BDHS11]) Suppose we are given an index set Z Z 3. Then the Matmul-like program for (i, j, k) Z, C(i, j) = C(i, j) + ij A(i, k) ijk B(k, j) must move Ω( Z /M 1/2 ) words. 9

25 Lower Bounds for Linear Algebra Theorem ([BDHS11]) Suppose we are given an index set Z Z 3. Then the Matmul-like program for (i, j, k) Z, C(i, j) = C(i, j) + ij A(i, k) ijk B(k, j) must move Ω( Z /M 1/2 ) words. Under some technical assumptions, this yields lower bounds for BLAS 3, e.g., A B, A 1 B; One-sided factorizations, e.g., LU, Cholesky, LDL T, ILU(t); Orthogonal factorizations, e.g., Gram Schmidt, QR, eigenvalue/singular value problems; Tensor contractions, some graph algorithms; and Sequences of these operations, interleaved arbitrarily. 9

26 Outline 1 Avoiding Communication in Linear Algebra Lower bounds for matrix multiplication attainable by blocking Lower bounds for linear algebra (... attainable? see [BDHS11]) 2 Beyond Linear Algebra: Affine Array References E.g., A(i + 2j, 3k + 4) Extends previous lower bounds to larger class of programs. Lower bounds are computable. Matching upper bounds (i.e., optimal algorithms) in special case: when array references pick a subset of the loop indices (e.g., A(i, k), B(k, j)). Ongoing work addresses attainability in the general case.

27 Generalization: Affine Array References Given loop iterations indexed (i 1,..., i d ) Z d, and parameters Param. Description Example: Matmul 10

28 Generalization: Affine Array References Given loop iterations indexed (i 1,..., i d ) Z d, and parameters Param. Description Example: Matmul d dimension of iteration space (rank 3 of Z d ) 10

29 Generalization: Affine Array References Given loop iterations indexed (i 1,..., i d ) Z d, and parameters Param. Description Example: Matmul d dimension of iteration space (rank 3 of Z d ) Z iteration space, a subset of Z d {1,..., N} 3 10

30 Generalization: Affine Array References Given loop iterations indexed (i 1,..., i d ) Z d, and parameters Param. Description Example: Matmul d dimension of iteration space (rank 3 of Z d ) Z iteration space, a subset of Z d {1,..., N} 3 A j, d j Each d j dimensional array A j is {A, B, C}, {2, 2, 2} subscripted by Z d j 10

31 Generalization: Affine Array References Given loop iterations indexed (i 1,..., i d ) Z d, and parameters Param. Description Example: Matmul d dimension of iteration space (rank 3 of Z d ) Z iteration space, a subset of Z d {1,..., N} 3 A j, d j Each d j dimensional array A j is {A, B, C}, {2, 2, 2} subscripted by Z d j φ j subscripts φ j : Z d Z d j (affine {(i, k), (k, j), (i, j)} combinations of loop indices) 10

32 Generalization: Affine Array References Given loop iterations indexed (i 1,..., i d ) Z d, and parameters Param. Description Example: Matmul d dimension of iteration space (rank 3 of Z d ) Z iteration space, a subset of Z d {1,..., N} 3 A j, d j Each d j dimensional array A j is {A, B, C}, {2, 2, 2} subscripted by Z d j φ j subscripts φ j : Z d Z d j (affine {(i, k), (k, j), (i, j)} combinations of loop indices) m number of arrays (injections into memory locations) 3 10

33 Generalization: Affine Array References Given loop iterations indexed (i 1,..., i d ) Z d, and parameters Param. Description Example: Matmul d dimension of iteration space (rank 3 of Z d ) Z iteration space, a subset of Z d {1,..., N} 3 A j, d j Each d j dimensional array A j is {A, B, C}, {2, 2, 2} subscripted by Z d j φ j subscripts φ j : Z d Z d j (affine {(i, k), (k, j), (i, j)} combinations of loop indices) m number of arrays (injections into memory locations) 3 for i = (i 1,..., i d ) Z Z d, inner_loop i (A 1 (φ 1 (i)),..., A m (φ m (i))) 10

34 Lower Bound Strategy for Affine Array References 11 Proof strategy: 1 For any (finite) set E Z of loop iterations that accesses O(M) operands (i.e., a cache block), find σ such that E = O(M σ ). Matmul: σ = 3/2.

35 Lower Bound Strategy for Affine Array References 11 Proof strategy: 1 For any (finite) set E Z of loop iterations that accesses O(M) operands (i.e., a cache block), find σ such that E = O(M σ ). Matmul: σ = 3/2. 2 Since data reuse = O(M σ 1 ), we conclude the program must move Ω( Z /M σ 1 ) words. Matmul: Ω(N 3 /M 1/2 ) words.

36 Upper Bounds via Hölder-Brascamp-Lieb (HBL) theory 12 Theorem (Extension of [BCCT10, Theorem 2.4]) For j {1,..., m}, let φ j : Z d Z d j be a nonnegative number. Then, be a group homomorphism and s j for all subgroups H of Z d, if and only if, rank(h) m s j rank(φ j (H)), j=1 for all finite subsets E of Z d, m E φ j (S) s j. j=1

37 Lower Bounds for Affine Array References Theorem (Communication Lower Bound) A program of the form for i = (i 1,..., i d ) Z, inner_loop i (A 1 (φ 1 (i)),..., A m (φ m (i))) must move Ω( Z /M ( j s j ) 1 ) words, where s = (s 1,..., s m ) satisfies rank(h) j s j rank(φ j (H)) for all subgroups H of Z d. 13

38 Lower Bounds for Affine Array References Theorem (Communication Lower Bound) A program of the form for i = (i 1,..., i d ) Z, inner_loop i (A 1 (φ 1 (i)),..., A m (φ m (i))) must move Ω( Z /M ( j s j ) 1 ) words, where s = (s 1,..., s m ) satisfies rank(h) j s j rank(φ j (H)) for all subgroups H of Z d. Proof sketch: Let E be any cache block ; bound data reuse (#iterations/#operands) 1 Operands: max j A j (φ j (E)) = max j φ j (E) M 13

39 Lower Bounds for Affine Array References Theorem (Communication Lower Bound) A program of the form for i = (i 1,..., i d ) Z, inner_loop i (A 1 (φ 1 (i)),..., A m (φ m (i))) must move Ω( Z /M ( j s j ) 1 ) words, where s = (s 1,..., s m ) satisfies rank(h) j s j rank(φ j (H)) for all subgroups H of Z d. Proof sketch: Let E be any cache block ; bound data reuse (#iterations/#operands) 1 Operands: max j A j (φ j (E)) = max j φ j (E) M 2 Iterations: E j φ j(e) s j (max j φ j (E) ) j s j 13

40 Lower Bounds for Affine Array References Theorem (Communication Lower Bound) A program of the form for i = (i 1,..., i d ) Z, inner_loop i (A 1 (φ 1 (i)),..., A m (φ m (i))) must move Ω( Z /M ( j s j ) 1 ) words, where s = (s 1,..., s m ) satisfies rank(h) j s j rank(φ j (H)) for all subgroups H of Z d. Proof sketch: Let E be any cache block ; bound data reuse (#iterations/#operands) 1 Operands: max j A j (φ j (E)) = max j φ j (E) M 2 Iterations: E j φ j(e) s j (max j φ j (E) ) j s j 3 Data reuse: O(M ( j s j ) 1 ) # words moved: Ω( Z /M ( j s j ) 1 ). (See paper for precise argument.) 13

41 A Linear Program to Compute σ (1/2) 14 We can write the set of inequalities (for subgroups H 1,..., H i,... of Z d ) rank(h 1 ) m j=1 s j rank(φ j (H 1 )). rank(h i ) m j=1 s j rank(φ j (H i )). as a system of inequalities rank(φ 1 (H 1 )) rank(φ m (H 1 )).. rank(φ 1 (H i )) rank(φ m (H i )).. or more succinctly, as s r. s 1. s m rank(h 1 ). rank(h i ),.

42 A Linear Program to Compute σ (2/2) Observation Any s [0, ) m that satisfies s r leads to a valid upper bound M σ, with σ = σ(s) = j s j = 1 T s. The smallest σ(s) leads to the tightest upper bound M σ(s), thus the tightest lower bound Ω( Z /M σ(s) 1 ). 15

43 A Linear Program to Compute σ (2/2) Observation Any s [0, ) m that satisfies s r leads to a valid upper bound M σ, with σ = σ(s) = j s j = 1 T s. The smallest σ(s) leads to the tightest upper bound M σ(s), thus the tightest lower bound Ω( Z /M σ(s) 1 ). Definition (HBL LP) minimize σ(s) = 1 T s s.t. s r 15

44 (Un-) Decidability (1/4) How do we write down the infinitely many constraints s r? Observation Entries of, r in {0, 1,..., d}, so (d + 1) m+1 unique constraints. 16

45 (Un-) Decidability (1/4) How do we write down the infinitely many constraints s r? Observation Entries of, r in {0, 1,..., d}, so (d + 1) m+1 unique constraints. For each of the (d + 1) m+1 possible constraints, is it a row of (, r)? Question (Version (1/3)) Given natural numbers r 0, r 1,..., r m, is there a subgroup H of Z d with rank(h) = r 0 and rank(φ j (H)) = r j for each j? 16

46 (Un-) Decidability (1/4) How do we write down the infinitely many constraints s r? Observation Entries of, r in {0, 1,..., d}, so (d + 1) m+1 unique constraints. For each of the (d + 1) m+1 possible constraints, is it a row of (, r)? Question (Version (1/3)) Given natural numbers r 0, r 1,..., r m, is there a subgroup H of Z d with rank(h) = r 0 and rank(φ j (H)) = r j for each j? We can also treat each φ j as a d-by-d integer matrix, and ask Question (Version (2/3))... is there a d-by-d integer matrix H with rank(h) = r 0 and rank(φ j H) = r j for each j? 16

47 (Un-) Decidability (2/4) 17 A matrix has rank r iff all (r + 1)-by-(r + 1) minors are zero and at least one r-by-r minor is nonzero. Each minor of H, φ 1 H,..., φ m H is a polynomial with variables in the d 2 entries of H, and coefficients computed from the d 2 entries of φ j. a11 a12 a13 a14 a15 a16 a21 a22 a23 a24 a25 a26 a31 a32 a33 a34 a35 a36 a41 a42 a43 a44 a45 a46 a51 a52 a53 a54 a55 a56 a61 a62 a63 a64 a65 a66 a71 a72 a73 a74 a75 a76

48 (Un-) Decidability (2/4) 17 A matrix has rank r iff all (r + 1)-by-(r + 1) minors are zero and at least one r-by-r minor is nonzero. Each minor of H, φ 1 H,..., φ m H is a polynomial with variables in the d 2 entries of H, and coefficients computed from the d 2 entries of φ j. a11 a12 a13 a14 a15 a16 a21 a22 a23 a24 a25 a26 a31 a32 a33 a34 a35 a36 a41 a42 a43 a44 a45 a46 a51 a52 a53 a54 a55 a56 a61 a62 a63 a64 a65 a66 a71 a72 a73 a74 a75 a76 Eliminate quantifiers, obtain system S of polynomial equations. Question (Version (3/3)) Does S have an integer solution?

49 (Un-) Decidability (3/4) Question (Hilbert s Tenth Problem for Z) Given any system of integer polynomial equations, can we decide whether it has a integer solution? 18

50 (Un-) Decidability (3/4) Question (Hilbert s Tenth Problem for Z) Given any system of integer polynomial equations, can we decide whether it has a integer solution? No, there are undecidable systems (MRDP Theorem) 18

51 (Un-) Decidability (3/4) Question (Hilbert s Tenth Problem for Z) Given any system of integer polynomial equations, can we decide whether it has a integer solution? No, there are undecidable systems (MRDP Theorem) Is our problem really this hard? 18

52 (Un-) Decidability (3/4) Question (Hilbert s Tenth Problem for Z) Given any system of integer polynomial equations, can we decide whether it has a integer solution? No, there are undecidable systems (MRDP Theorem) Is our problem really this hard? Observation If a rational matrix H has the desired rank properties, we can obtain an integer matrix with the same properties. 18

53 (Un-) Decidability (4/4) Question (Hilbert s Tenth Problem for Q) Given any system of rational polynomial equations, can we decide whether it has a rational solution? A longstanding open question. 19

54 (Un-) Decidability (4/4) Question (Hilbert s Tenth Problem for Q) Given any system of rational polynomial equations, can we decide whether it has a rational solution? A longstanding open question. Theorem Our question is decidable if and only if the answer to HTP(Q) is Yes. 19

55 (Un-) Decidability (4/4) Question (Hilbert s Tenth Problem for Q) Given any system of rational polynomial equations, can we decide whether it has a rational solution? A longstanding open question. Theorem Our question is decidable if and only if the answer to HTP(Q) is Yes. Proof sketch: ( ) Solve polynomial system. ( ) Code any polynomial system as a set of matrices φ 1,..., φ m with certain properties... 19

56 An Equivalent Linear Program (1/2) Observation The feasible region of HBL LP is a convex polytope P. P = intersection of half-spaces defined by rows of s r. There may be many (, r) that have the same feasible region. At least one having a finite number of rows 20

57 An Equivalent Linear Program (2/2) 21 Let H 1, H 2,... be a sequence of all subgroups of Z d, and let P be the induced polytope. Each subsequence H 1,..., H N induces a polytope P N. Lemma For each N, the corresponding polytope P N P.

58 An Equivalent Linear Program (2/2) Let H 1, H 2,... be a sequence of all subgroups of Z d, and let P be the induced polytope. Each subsequence H 1,..., H N induces a polytope P N. Lemma For each N, the corresponding polytope P N P. Lemma There is a finite M where for all N M, P N = P. 21

59 An Equivalent Linear Program (2/2) Let H 1, H 2,... be a sequence of all subgroups of Z d, and let P be the induced polytope. Each subsequence H 1,..., H N induces a polytope P N. Lemma For each N, the corresponding polytope P N P. Lemma There is a finite M where for all N M, P N = P. Lemma There is a (decidable) algorithm to check if P N = P. 21

60 An Equivalent Linear Program (2/2) Let H 1, H 2,... be a sequence of all subgroups of Z d, and let P be the induced polytope. Each subsequence H 1,..., H N induces a polytope P N. Lemma For each N, the corresponding polytope P N P. Lemma There is a finite M where for all N M, P N = P. Lemma There is a (decidable) algorithm to check if P N = P. Theorem P is computable. 21

61 An Approximate Linear Program It is also possible to obtain an approximation of (, r): 1 Compute a superpolytope of P. Sample subgroups H 1,..., H N ; the resulting ˇσ σ. Comm. lower bound Ω( Z /M ˇσ 1 ) is possibly invalid (too tight) or attainable. 22

62 An Approximate Linear Program 22 It is also possible to obtain an approximation of (, r): 1 Compute a superpolytope of P. Sample subgroups H 1,..., H N ; the resulting ˇσ σ. Comm. lower bound Ω( Z /M ˇσ 1 ) is possibly invalid (too tight) or attainable. 2 Compute a subpolytope of P. Embed into R, i.e., φ j : R d R d j. HTP(R) is Tarski-decidable; the resulting ˆσ σ. Comm. lower bound Ω( Z /M ˆσ 1 ) is valid and possibly unattainable.

63 Special Case: Subscripts are Subsets of Indices e.g., A(i, k), B(k, j), C(i, j), subsets of {i, j, k} 23

64 Special Case: Subscripts are Subsets of Indices e.g., A(i, k), B(k, j), C(i, j), subsets of {i, j, k} Theorem Suppose every φ j projects a subset of the loop indices i 1,..., i d. Let e be the d rows of corresponding to subgroups H i = e i for i = 1 to d. then the linear program minimize 1 T s s.t. e s 1 yields the same optimum σ(s) as HBL LP, and furthermore, the dual linear program maximize 1 T x s.t. T e x 1 gives the optimal block size M x i for each loop. 23

65 Special Case: Subscripts are Subsets of Indices e.g., A(i, k), B(k, j), C(i, j), subsets of {i, j, k} Theorem Suppose every φ j projects a subset of the loop indices i 1,..., i d. Let e be the d rows of corresponding to subgroups H i = e i for i = 1 to d. then the linear program minimize 1 T s s.t. e s 1 yields the same optimum σ(s) as HBL LP, and furthermore, the dual linear program maximize 1 T x s.t. T e x 1 gives the optimal block size M x i for each loop. Note: In practice, we only need to solve the dual: 1 T x = 1 T s = σ(s). 23

66 Special Case: Example 1/3: Matmul 24 Original code: for i = 1 : N, for j = 1 : N, for k = 1 : N, C(i, j) += A(i, k) B(k, j)

67 Special Case: Example 1/3: Matmul 24 Original code: for i = 1 : N, for j = 1 : N, for k = 1 : N, C(i, j) += A(i, k) B(k, j) Now we write down and solve the linear program: e = i j k A /2 B x = 1/2 σ(s) = 3/2 C /2

68 24 Special Case: Example 1/3: Matmul Original code: for i = 1 : N, for j = 1 : N, for k = 1 : N, C(i, j) += A(i, k) B(k, j) Corollary For any execution of the code, the number of words moved is Ω( Z /M σ(s) 1 ) = Ω(N 3 /M 1/2 ), and this is attained by blocks of size M 1/2 -by-m 1/2 -by-m 1/2 in the following code (b = M 1/2 ). for i 1 = 1 : b : N, for j 1 = 1 : b : N, for k 1 = 1 : b : N, for i 2 = 0 : b 1, for j 2 = 0 : b 1, for k 2 = 0 : b 1, (i, j, k) = (i 1, j 1, k 1 ) + (i 2, j 2, k 2 )... // inner loop with index (i, j, k) The block sizes may have to be smaller by a constant factor, e.g. b i = (M/m) x i = (M/3) 1/2, to fit in cache simultaneously.

69 Special Case: Example 2/3: N-body Original code: for i = 1 : N, for j = 1 : N, F (i) += compute_force(p(i), P(j)) 25

70 Special Case: Example 2/3: N-body Original code: for i = 1 : N, for j = 1 : N, F (i) += compute_force(p(i), P(j)) Now we write down and solve the linear program: e = i j F 1 0 P x = P ( ) 1 1 σ(s) = 2 25

71 Special Case: Example 2/3: N-body Original code: for i = 1 : N, for j = 1 : N, F (i) += compute_force(p(i), P(j)) Corollary For any execution of the code, the number of words moved is Ω( Z /M σ(s) 1 ) = Ω(N 2 /M), and this is attained by blocks of size M-by-M in the following code (b = M). for i 1 = 1 : b : N, for j 1 = 1 : b : N, for i 2 = 0 : b 1, for j 2 = 0 : b 1, (i, j) = (i 1, j 1 ) + (i 2, j 2 )... // inner loop with index (i, j) The block sizes may have to be smaller by a constant factor, e.g. b i = (M/m) x i = M/3, to fit in cache simultaneously. 25

72 Special Case: Example 3/3: Complicated Code Original code: for i 1 = 1 : N, for i 2 = 1 : N,..., for i 6 = 1 : N, A 1 (i 1, i 3, i 6 ) += func 1 (A 2 (i 1, i 2, i 4 ), A 3 (i 2, i 3, i 5 ), A 4 (i 3, i 4, i 6 )) A 5 (i 2, i 6 ) += func 2 (A 6 (i 1, i 4, i 5 ), A 3 (i 3, i 4, i 6 )) 26

73 Special Case: Example 3/3: Complicated Code Original code: for i 1 = 1 : N, for i 2 = 1 : N,..., for i 6 = 1 : N, A 1 (i 1, i 3, i 6 ) += func 1 (A 2 (i 1, i 2, i 4 ), A 3 (i 2, i 3, i 5 ), A 4 (i 3, i 4, i 6 )) A 5 (i 2, i 6 ) += func 2 (A 6 (i 1, i 4, i 5 ), A 3 (i 3, i 4, i 6 )) Now we write down and solve the linear program: e = i 1 i 2 i 3 i 4 i 5 i 6 A A A 3, A 4, A 3, A A /7 3/7 x = 1/7 2/7 3/7 4/7 σ(s) = 15/7 26

74 Special Case: Example 3/3: Complicated Code Original code: for i 1 = 1 : N, for i 2 = 1 : N,..., for i 6 = 1 : N, A 1 (i 1, i 3, i 6 ) += func 1 (A 2 (i 1, i 2, i 4 ), A 3 (i 2, i 3, i 5 ), A 4 (i 3, i 4, i 6 )) A 5 (i 2, i 6 ) += func 2 (A 6 (i 1, i 4, i 5 ), A 3 (i 3, i 4, i 6 )) Corollary For any execution of the code, the number of words moved is Ω( Z /M σ(s) 1 ) = Ω(N 6 /M 8/7 ), and this is attained by blocks of size M 2/7 -by-m 3/7 -by-m 1/7 -by-m 2/7 -by-m 3/7 -by-m 4/7 in the following code (b i = M x i ). for i 1,1 = 1 : b 1 : N,..., for i 1,6 = 1 : b 6 : N, for i 2,1 = 0 : b 1 1,..., for i 2,6 = 0 : b 6 1, (i 1,..., i 6 ) = (i 1,1,..., i 1,6 ) + (i 2,1,..., i 2,6 )... // inner loop with index (i 1,..., i 6 ) The block sizes may have to be smaller by a constant factor, e.g. b i = (M/m) x i = (M/6) x i, to fit in cache simultaneously. 26

75 Extending to Other Machine Models 27 Our lower bounds extend to more complicated machines: SLOW FAST

76 Extending to Other Machine Models 27 Our lower bounds extend to more complicated machines: Multiple levels of memory: apply lower bound to each pair of adjacent levels SLOW FAST 1 FAST 2 FAST 3

77 Extending to Other Machine Models 27 Our lower bounds extend to more complicated machines: Multiple levels of memory: apply lower bound to each pair of adjacent levels Homogeneous parallel processors: Z /P work per processor SLOW FAST 1 FAST 2 FAST 3 LOCAL LOCAL LOCAL LOCAL LOCAL

78 Extending to Other Machine Models 27 Our lower bounds extend to more complicated machines: Multiple levels of memory: apply lower bound to each pair of adjacent levels Homogeneous parallel processors: Z /P work per processor Hierarchical parallel processors SLOW FAST 1 FAST 2 FAST 3 LOCAL LOCAL LOCAL LOCAL LOCAL

79 Extending to Other Machine Models 27 Our lower bounds extend to more complicated machines: Multiple levels of memory: apply lower bound to each pair of adjacent levels Homogeneous parallel processors: Z /P work per processor Hierarchical parallel processors Heterogeneous machines: optimization problem to balance Z work SLOW FAST 1 FAST 2 FAST 3 LOCAL LOCAL LOCAL LOCAL LOCAL

80 Optimal Parallel Algorithms Sequential blocking gives parallel working sets. 28

81 Optimal Parallel Algorithms Sequential blocking gives parallel working sets. Theory suggests replicating inputs/outputs (e.g., 3D Matmul). Generalize 2.5D algorithms: N body, tensor operations, etc. 28

82 Optimal Parallel Algorithms Sequential blocking gives parallel working sets. Theory suggests replicating inputs/outputs (e.g., 3D Matmul). Generalize 2.5D algorithms: N body, tensor operations, etc. Optimal algorithms for special case, i.e., Ω(( Z /P)/M σ 1 ) is attainable. Ongoing work addresses general case... 28

83 Ongoing Work: Lower Bounds Want to compute lower bounds more efficiently. Claim Rather than enumerating all subgroups H 1, H 2,... of Z d, we only need to check a subset: the lattice L generated by the kernels of φ 1,..., φ m. 29

84 Ongoing Work: Lower Bounds Want to compute lower bounds more efficiently. Claim Rather than enumerating all subgroups H 1, H 2,... of Z d, we only need to check a subset: the lattice L generated by the kernels of φ 1,..., φ m. L may be finite subsets of indices case m 3 May only need to enumerate a finite semilattice L L All subscripts φ j have rank in {0, 1, d 1, d} d 4 29

85 Ongoing Work: Optimal Algorithms Claim For any s on the surface of the polytope, there is a sequence of subsets E of Z d with increasing cardinality which satisfy E = Θ ( j φ j(e) s j ). 30

86 Ongoing Work: Optimal Algorithms Claim For any s on the surface of the polytope, there is a sequence of subsets E of Z d with increasing cardinality which satisfy E = Θ ( j φ j(e) s j ). Question Are there sets which satisfy E = Θ ( (max j φ j (E) ) min s P j s ) j? 30

87 Ongoing Work: Optimal Algorithms Claim For any s on the surface of the polytope, there is a sequence of subsets E of Z d with increasing cardinality which satisfy E = Θ ( j φ j(e) s j ). Question Are there sets which satisfy E = Θ ( (max j φ j (E) ) min s P j s ) j? Question How big are the hidden constants? 30

88 Ongoing Work: Data Dependencies Partial order on Z encoded as DAG Some sets E are inadmissible (cannot be blocked) Question Can we tighten our bound E j φ j(e) s j for admissible sets? 31

89 Ongoing Work: Data Dependencies Partial order on Z encoded as DAG Some sets E are inadmissible (cannot be blocked) Question Can we tighten our bound E j φ j(e) s j for admissible sets? Question Can we extend our parallel theory to expose tradeoffs between: Concurrency, efficiency, memory, communication,... 31

90 Ongoing Work: Cost Model Only discussed communication volume (bandwidth cost). 32

91 Ongoing Work: Cost Model Claim Only discussed communication volume (bandwidth cost). Extend model to address # Messages/synchronizations (latency cost) Message size 1 w M words, so a latency lower bound is # words moved/w = Ω( Z /(wm σ 1 )) = Ω( Z /M σ ) messages. 32

92 Ongoing Work: Cost Model Claim Only discussed communication volume (bandwidth cost). Extend model to address # Messages/synchronizations (latency cost) Message size 1 w M words, so a latency lower bound is # words moved/w = Ω( Z /(wm σ 1 )) = Ω( Z /M σ ) messages. Energy/power costs 32

93 Ongoing Work: Cost Model Claim Only discussed communication volume (bandwidth cost). Extend model to address # Messages/synchronizations (latency cost) Message size 1 w M words, so a latency lower bound is # words moved/w = Ω( Z /(wm σ 1 )) = Ω( Z /M σ ) messages. Energy/power costs Network topology, congestion 32

94 Conclusions 33

95 Conclusions Communication is slowing you down! 33

96 Conclusions Communication is slowing you down! Lower bounds motivate new/improved algorithms 33

97 Conclusions Communication is slowing you down! Lower bounds motivate new/improved algorithms Previous work: Matmul [HK81, ITT04], linear algebra [BDHS11] 33

98 Conclusions Communication is slowing you down! Lower bounds motivate new/improved algorithms Previous work: Matmul [HK81, ITT04], linear algebra [BDHS11] This work: programs with affine array references 33

99 Conclusions Communication is slowing you down! Lower bounds motivate new/improved algorithms Previous work: Matmul [HK81, ITT04], linear algebra [BDHS11] This work: programs with affine array references Goal: Compiler generates communication-optimal code Tech. report to appear soon at bebop.cs.berkeley.edu Thank You 33

100 References I J. Bennett, A. Carbery, M. Christ, and T. Tao. Finite bounds for Hölder-Brascamp-Lieb multilinear inequalities. Mathematical Research Letters, 17(4): , G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM Journal on Matrix Analysis and Applications, 32(3): , J.-W. Hong and H.T. Kung. I/O complexity: the red-blue pebble game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, pages ACM, D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. Journal of Parallel and Distributed Computing, 64(9): , L.H. Loomis and H. Whitney. An inequality related to the isoperimetric inequality. Bulletin of the American Mathematical Society, 55: ,

Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays - Part 1

Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays - Part 1 Michael Christ James Demmel Nicholas Knight Thomas Scanlon Katherine A. Yelick Electrical Engineering and Computer