Fundamental Tradeoff between Computation and Communication in Distributed Computing

Size: px

Start display at page:

Download "Fundamental Tradeoff between Computation and Communication in Distributed Computing"

Patricia Boyd
5 years ago
Views:

1 Fundamental Tradeoff between Computation and Communication in Distributed Computing Songze Li, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr Department of Electrical Engineering, University of Southern California, Los Angeles, CA, USA Noia Bell Labs, Holmdel, NJ, USA s: Abstract We introduce a general distributed computing framewor, motivated by commonly used structures lie MapReduce, and formulate an information-theoretic tradeoff between computation and communication in such a framewor. We characterize the optimal tradeoff to within a constant factor, for all system parameters. In particular, we propose a coded scheme, namely Coded MapReduce (CMR), which creates and exploits coding opportunities in data shuffling for distributed computing, reducing the communication load by a factor that is linearly proportional to the computation load. We then prove a lower bound on the minimum communication load, and demonstrate that CMR achieves this lower bound to within a constant factor. This result reveals a fundamental connection between computation and communication in distributed computing the two are inverse-linearly proportional to each other. I. INTRODUCTION We consider a general distributed computing framewor, motivated by commonly used structures lie MapReduce [] and Spar [2], in which the computation is broen to two stages: Map and Reduce. In such a framewor, first distributed computing nodes map parts of the input data to some intermediate values according to their designed Map functions. Next, they exchange the results of the computed Map functions among each other (a..a. data shuffling), in order to calculate the final output results distributedly using their designed Reduce functions. In this framewor, communication (or data shuffling) is a ey component, which often appears to limit the performance of applications lie tera-sort [], raned-inverted-index [3] and machine learning algorithms [4]. In a Faceboo s Hadoop cluster, it is observed that 33% of the job run time is spent on data shuffling [4]. As such motivated, the main objective of this paper is to formulate and characterize the information-theoretic tradeoff between computation and communication in this framewor, and to demonstrate the significant impact of coding in reducing the communication load. More specifically, we consider a distributed computing framewor to compute Q arbitrary output functions from N input files, using distributed computing nodes interconnected through a shared lin. As mentioned earlier, the framewor decomposes the overall computation into computing a set of Map and Reduce functions distributedly across the nodes. We define the computation load r of this framewor as the normalized total number of computed Map functions at the nodes, and the communication load L as the normalized total amount of information exchanged between the nodes, in order to calculate the Reduce functions. Based on this formulation, we then as the following fundamental question: Given a computation load r, what is the minimum communication load L (r) needed to compute the final output functions? We answer this question by characterizing L (r) to within a constant factor for all system parameters. In other words, we approximately characterize the optimal tradeoff between computation and communication in the above framewor. Our result yields a fundamental relationship between computation and communication in distributed computing the two are inverselinearly proportional to each other, meaning that increasing the computation load by a factor of x can reduce the communication load by the same factor. Quite interestingly, this tradeoff is only achieved by utilizing coding, and as we illustrate, an uncoded scheme is substantially sub-optimal. To show this result, we propose a general coded scheme, namely Coded MapReduce (CMR), which specifies a strategy to assign the computations of the Map and Reduce functions across the computing nodes, enabling a novel coding approach for data shuffling. In particular, CMR taes advantage of a carefully designed repetitive mapping of data blocs at distinct nodes, creating coded multicast opportunities that deliver data simultaneously to multiple nodes in one channel use. Perhaps surprisingly, compared with an uncoded data shuffling scheme, CMR is able to reduce the communication load by exactly a factor of the computation load r. We also prove a lower bound on the minimum communication load L (r), and demonstrate that CMR achieves the lower bound to within a constant factor, indicating that the communication load attained by CMR indeed characterizes the fundamental tradeoff between computation and communication in the considered framewor. To prove the lower bound, we apply the cut-set bound on the extended system for many viable assignments of the Reduce functions (which node computes which Reduce function) and rely on the fact that the communication load is independent of the assignment.

2 2 Other related wors. The problem of characterizing the minimum communication for distributed computing has been considered previously in several settings in both computer science and information theory communities. In [5], a basic computing model is proposed, where two parities have x and y and aim to compute a boolean function f(x, y) with the minimum number of bits exchanged among them. Also, the problem of characterizing the minimum communication required for computing the modulo-two sum of distributed binary sources with symmetric joint distribution was introduced in [6]. Following these two seminal wors, a wide range of communication problems in the scope of distributed computing have been studied (cf. [7] []). The ey differences distinguishing the setting in this paper from most of the prior ones are ) We focus on the flow of communication in a general framewor, motivated by MapReduce, rather than the structures of the functions or the input distributions. 2) We consider arbitrary large numbers of output results, input data files and computing nodes. 3) We do not assume any special property (e.g. linearity) of the computed functions. The idea of efficiently creating and exploiting coded multicasting was initially proposed in the context of cache networs in [2], [3], and extended in [4], [5], where caches pre-fetch part of the content in a way to enable coding during the content delivery. On top of such coding opportunities, the proposed CMR scheme also exploits the naive multicasting opportunities due to the common interest of the same data at different nodes, further reducing the communication load. II. PROBLEM FORMULATION In this section, we propose a general distributed computing framewor motivated by MapReduce, and define the function characterizing the tradeoff between computation and communication of such framewor. For some system parameters Q, N, N, we consider the problem of computing Q output functions from N input files using a cluster of distributed computing nodes (servers). More specifically, given N input files w,..., w N F 2 F, for some F N, the goal is to compute Q output functions φ,..., φ Q, where φ q : (F 2 F ) N F 2 B, q {,..., Q} maps all input files to a length-b binary stream u q = φ q (w,..., w N ) F 2 B. Motivated by MapReduce, we assume that as illustrated in Fig. the computation of the output function φ q, q {,..., Q} can be decomposed as follows: φ q (w,..., w N ) = h q (g q, (w ),..., g q,n (w N )), () where The Map functions g n = (g,n,..., g Q,n ) : F 2 F (F 2 T ) Q, n {,..., N} maps the input file w n into Q length-t intermediate values v q,n = g q,n (w n ) F 2 T, q {,..., Q}, for some T N, The Reduce functions h q : (F 2 T ) N F 2 B, q {,..., Q} maps the intermediate values of the output function φ q in all input files into the output value u q = h q (v q,,..., v q,n ). Map Functions Reduce Functions Fig. : Illustration of a two-stage distributed computing framewor. The overall computation is decomposed into computing a set of Map and Reduce functions. Remar. Note that for every set of output functions φ,..., φ Q such a Map-Reduce decomposition exists (e.g. setting g q,n s to identity and h q to φ q ). However, such a decomposition is not unique, and in the distributed computing literature, there has been quite some wor on developing appropriate decompositions of computations lie join, sorting and matrix multiplication (cf. [], [6]), which are suitable for efficient distributed computing. Here we do not impose any constraint on how the Map and Reduce functions are chosen (for example, they can be arbitrary linear or non-linear functions). The above computation is carried out by distributed computing nodes, labelled as Node,..., Node. They are interconnected through a shared and error-free lin such that the messages sent by one node are received by all the other nodes. Following the above decomposition, the computation proceeds in three phases: Map, Shuffle and Reduce. Map Phase: Node, {,..., } computes the Map functions of a set of files M, where M {w,..., w N } is a design parameter. For each file w n in M, Node computes g n (w n )=(v,n,..., v Q,n ). We assume that each file is mapped by at least one node, i.e., M = {w,..., w N }. =,...,

3 3 Definition (Computation Load). We define the computation load, denoted by r, as the total number of Map functions = computed across the nodes, normalized by the number of files N, i.e., r M N. The computation load r can be interpreted as the average number of nodes that map each file, here for simplicity we assume that r {,..., }. Shuffle Phase: Node, {,..., } is responsible for computing a subset of output functions, whose indices are denoted by a set W {,..., Q}. The computations of the output functions are assigned uniformly and disjointly across the nodes, such that ) W = = W = Q N, 2) W j W = for j. To compute the output value u q for some q W, Node needs the intermediate values that are not computed locally in the Map phase, i.e., {v q,n : q W, w n / M }. Nodes communicate via the shared lin to exchange the needed intermediate values. We formally define a shuffling scheme as follows: Each node, {,..., }, creates a message X as a function of the intermediate values computed locally during the Map phase, i.e., X = ψ ({ g n : w n M }), and broadcasts it to all other nodes. Definition 2 (Communication Load). We define the communication load, denoted by L, as the number of bits communicated during the Shuffle phase, normalized by QNT, which is the total number of bits in all intermediate values {v q,n : q {,..., Q}, n {,..., N}}. Reduce Phase: Node, {,..., } uses the local results from the Map phase { g n : w n M } and the received messages X,..., X in the Shuffle phase to construct the inputs to the corresponding Reduce functions of W, and calculates u q = h q (v q,... v q,n ) for all q W. We say that a computation-communication pair (r, L) N R is feasible if there exist M,..., M, W,..., W and a shuffling scheme such that Node can successfully compute all the output functions whose indices are in W, for all {,..., }. Definition 3. We define the computation-communication function of the distributed computing framewor L (r) inf{l : (r, L) is feasible}. (2) L (r) characterizes the optimal tradeoff between computation and communication in this framewor. Example (Uncoded Scheme). In the Shuffle phase of a simple uncoded scheme, each node receives the needed intermediate values sent uncodedly by some other nodes. Since a total of QN intermediate values are needed and rn Q = rqn of them are already available after the Map phase, the communication load achieved by the uncoded scheme L uncoded (r) = r/. (3) Remar 2. After the Map phase, each node nows the intermediate values of all output functions in the files it has mapped. Therefore, the data requirements of all valid assignments of the Reduce functions can be satisfied with identical communication loads (the shuffling schemes are different though). In other words, the communication load is independent of the assignment of the Reduce functions. In this paper, we also consider a generalization of the above framewor, which we call cascaded distributed computing framewor, where after the Map phase, each Reduce function is computed by more than one or particularly s nodes, for some s {,..., }. This generalized model is motivated by the fact that many distributed computing jobs require multiple rounds of Map and Reduce computations, where the Reduce results of the previous round serve as the inputs to the Map functions of the next round. Computing each Reduce function at more than one node admits data redundancy for the subsequent Mapfunction computations, which can help to improve the fault-tolerance and reduce the communication load of the next-round data shuffling. The feasible computation-communication triple (r, s, L) is defined similar as before. We define the computationcommunication function of the cascaded distributed computing framewor L (r, s) inf{l : (r, s, L) is feasible}. (4) III. MAIN RESULTS Theorem. The computation-communication function of the distributed computing framewor, L (r) is bounded as for sufficiently large N, and r {,..., }. r ( r ) 3+ < L (r) 5 r ( r ), (5) To prove the upper bound, we propose a coded scheme in the next section, namely Coded MapReduce, which achieves the communication load L c (r) r ( r ). We demonstrate that no other scheme can achieve a communication load smaller by proving the lower bound in Section V. than Lc(r) 3+ 5 Remar 3. Theorem characterizes the computation-communication function of the distributed computing framewor to within a constant multiplicative gap of

4 Uncoded Scheme in (3) Proposed Coded Scheme Communication Load (L) Computation Load (r) Fig. 2: Comparison of the communication load achieved by the proposed coded scheme in Theorem with that of the uncoded scheme in (3), for Q = 0 output functions, N = 2520 input files and = 0 computing nodes. Remar 4. The communication load achieved in Theorem is less than that of the uncoded scheme in (3) by a multiplicative factor of r, which equals the computation load and can grow unboundedly as the number of nodes increases if e.g. r = Θ(). As illustrated in Fig. 2, while the communication load of the uncoded scheme decreases linearly as the computation load increases, L c (r) achieved in Theorem is inverse-linearly proportional to the computation load. Remar 5. While increasing the computation load r causes a longer Map phase, the coded scheme of Theorem maximizes the reduction of the communication load using the extra computations. Therefore, Theorem provides an analytical framewor to optimally allocate the computation and communication resources, minimizing the job execution time. Theorem 2. The computation-communication function of the cascaded distributed computing framewor, L (r, s) is upper bounded as min{r+s,} l ( )( l 2 )( r ) L l r l s (r, s) r ( )( ), (6) r s for sufficiently large Q and N, and r, s {,..., }. l=max{r+,s} Remar 6. A preliminary part of this result, in particular the achievability for the special case of s =, or the achievable scheme of Theorem was presented in [7]. We note that when s =, Theorem 2 provides the same upper bound as in Theorem, i.e., L (r, ) r ( r ) s= s=2 s=3 Communication Load (L) Computation Load (r) Fig. 3: Communication load achieved in Theorem 2, for Q=360 output functions, N =2520 input files and =0 computing nodes. Remar 7. For any fixed s (number of nodes that compute each Reduce function), as illustrated in Fig. 3, the communication load achieved in Theorem 2 outperforms the linear relationship between computation and communication, i.e., it is superlinear with respect to the computation load r. IV. ACHIEVABILITY: CODED MAPREDUCE In this section, we prove the upper bounds in Theorem and 2 by presenting and analyzing a coded scheme, which we call Coded MapReduce (CMR). We focus on the more general case considered in Theorem 2 with s, and the scheme for Theorem simply follows by setting s =. When r =, every node can map all the input files and compute all the output functions locally, thus no communication is needed and L () = 0. In what follows, we focus on the case where r <. Along the presentation of the general CMR scheme, we use an illustrative example to demonstrate the ey ingredients of the scheme.

5 5 Example (Coded MapReduce). We illustrate the ey ideas of CMR using an example with Q = 6 output functions, N = 6 input files, and = 4 nodes. We consider the case where the computation load r = 2, and each Reduce function is computed by s = 2 nodes. A. Map Phase Design We assume that the number of input files N is sufficiently large such that N = ( ) r η for some η N. In the Map phase the N input files are evenly partitioned into ( ) r disjoint batches of size η, each corresponding to a subset T {,..., } of size r: {w,..., w N } = {B T : T {,..., }, T = r}, (7) where B T denotes the batch of η files corresponding to the subset T. Given this partition, Node, {,..., } computes the Map functions of the files in B T if T. Or equivalently, B T M if T. Since each node is in ( ) ( r subsets of size r, each node computes ) r η = rn Map functions, i.e., M = rn for all {,..., }. After the Map phase, Node, {,..., } nows the intermediate values of all Q output functions in the files in M, i.e., {v q,n : q {,..., Q}, w n M }. Example (Coded MapReduce: Map Phase Design). In the example, the Map phase is designed such that every r = 2 nodes map a common file. More precisely as illustrated in Fig. 4, the sets of the files mapped by the 4 nodes are M = {w, w 2, w 3 }, M 2 = {w, w 4, w 5 }, M 3 = {w 2, w 4, w 6 }, and M 4 = {w 3, w 5, w 6 }. After the Map phase, Node, {, 2, 3, 4} nows the intermediate values of all Q = 6 output functions in the files in M, i.e., {v q,n : q {,..., 6}, w n M }. Node Multicast Node Node 2 Multicast Node 2 Node 3 Multicast Node 3 Node 4 Multicast Fig. 4: Illustration of the CMR scheme to compute Q = 6 output functions from N = 6 input files distributedly at = 4 computing nodes. Each file is mapped by r = 2 nodes and each output function is computed by s = 2 nodes. After the Map phase, every node nows 6 intermediate values, one for each output functions, in each file it has mapped. In the Shuffle phase, Node, {, 2, 3, 4} splits the locally computed intermediate values v q,n evenly into 2 segments v q,n = (v q,n, () v q,n), (2) and multicasts 2 random linear combinations of the segments C(,, ), C(, 2, ). B. Coded Data Shuffling We assume that the number of the output functions Q are sufficiently large such that Q = ( ) s η2 for some η 2 N. The assignment of the computations of the Reduce functions are performed similar to that of the Map functions, where the Q Reduce functions are first evenly partitioned into ( ) s disjoint batches of size η2, each corresponding to a unique subset P of s nodes: {,..., Q} = {D P : P {,..., }, P = s}, (8) Node 4

6 6 where D P denotes the indices of the batch of η 2 Reduce functions corresponding to the subset P. Given this partition, Node, {,..., } computes the Reduce functions whose indices are in D P if P. Or equivalently, D P W if P. As a result, each node computes ( ) s η2 = sq Reduce functions, i.e., W = sq for all {,..., }. Example (Coded MapReduce: Reduce-Function Assignment). For our running example, the Reduce functions are assigned such that every s=2 nodes compute a common Reduce function. More specifically as shown in Fig. 4, the sets of indices of the Reduce functions computed by the = 4 nodes are W = {, 2, 3}, W 2 = {, 4, 5}, W 3 = {2, 4, 6}, and W 4 = {3, 5, 6}. Therefore, for example, Node still needs the intermediate values {v q,n : q {, 2, 3}, n {4, 5, 6}} through data shuffling to compute its assigned Reduce functions h, h 2, h 3. For a subset S of {,..., } and S with = r, we denote the set of intermediate values needed by all nodes in S\, no node outside S, and nown exclusively by nodes in as V S\S. More formally: V S\S {v q,n : q W, q / W, w n M }. (9) S\ / S S The shuffling scheme of CMR consists of multiple rounds, each corresponding to all the subsets of the nodes with a particular size. Within each subset, the needed intermediate values are split into segments that are uniformly associated to a subset of computing nodes. Then each node multicasts random linear combinations of the segments that are associated with it. More specifically, for all the subsets S {,..., } of size max{r +, s} S min{r + s, } and all S of size = r: ) For each intermediate value v q,n in V S\S, we evenly split it into r segments of T r segment with a distinct node in. By doing so we evenly split V S\S 2) For each Node j in S, it sends ( )( S 2 r r can be formally expressed as bits: v() q,n..., v (r) q,n, and associate each into r disjoint partitions {V S\S,j : j }. S s) η η 2 random linear combinations of all segments associated with it, which U S j S, =r,j V S\S,j. (0) Example (Coded MapReduce: Coded Data Shuffling). Applying the above described shuffling scheme to the running example results in two rounds of coded data exchange, which are illustrated in Fig. 4. In the first round, intermediate values are communicated within each subset of 3 nodes. In the subset S = {, 2, 3}, we have as defined in (9), V {} {2,3} = {v,4, v 2,4 }, V {2} {,3} = {v,2, v 4,2 }, and V {3} {,2} = {v 2,, v 4, }. We then associate the intermediate values to the nodes such that Node is responsible for sending {v,2, v 2, }, Node 2 is responsible for sending {v 4,, v,4 }, and Node 3 is responsible for sending {v 4,2, v 2,4 }. During the data shuffling, each node in the set {, 2, 3} multicasts the bit-wise XOR, denoted by, of its associated intermediate values. Since Node 2 nows v 2, and Node 3 nows v,2 from the locally computed Map functions, they can respectively decode v,2 and v 2, from the coded message v,2 v 2,. We employ the similar shuffling scheme on the other 3 subsets of 3 nodes. After the first round of shuffling: Node decodes (v,4, v,5 ), (v 2,4, v 2,6 ) and (v 3,5, v 3,6 ), Node 2 decodes (v,2, v,3 ), (v 4,2, v 4,6 ) and (v 5,3, v 5,6 ), Node 3 decodes (v 2,, v 2,3 ), (v 4,, v 4,5 ) and (v 6,3, v 6,5 ), Node 4 decodes (v 3,, v 3,2 ), (v 5,, v 5,4 ) and (v 6,2, v 6,4 ). Every node needs one more input intermediate value (from the second round of data shuffling) to compute each of its assigned Reduce functions. In the second round, we consider the set of all 4 nodes. We split each intermediate value v q,n in {V {,2,3,4}\S : = 2} into r = 2 segments v q,n () and v q,n, (2) and associate them with the nodes such that Node is responsible for sending {v () 4,3, v() 5,2, v() 6, }, Node 2 is responsible for sending {v() 2,5, v() 3,4, v(2) 6, }, Node 3 is responsible for sending {v(),6, v(2) 3,4, v(2) 5,2 }, and Node 4 is responsible for sending {v (2),6, v(2) 2,5, v(2) 4,3 }. That is as defined in (0): U {,2,3,4} ={v () 4,3, v() 5,2, v() 6, }, U {,2,3,4} 2 ={v () 2,5, v() 3,4, v(2) 6, }, U {,2,3,4} 3 ={v (),6, v(2) 3,4, v(2) 5,2 }, U {,2,3,4} 4 ={v (2),6, v(2) 2,5, v(2) 4,3 }. Then for each {, 2, 3, 4}, Node multicasts 2 random linear combinations C (,, ), C2 (,, ) of the 3 data segments in U, which are C(v i () 4,3, v() 5,2, v() 6, ), Ci 2(v () 2,5, v() 3,4, v(2) 6, ), Ci 3(v (),6, v(2) 3,4, v(2) 5,2 ), Ci 4(v (2),6, v(2) 2,5, v(2) 4,3 ), i=, 2. Having received the 2 coded data segments C(v i () 4,3, v() 5,2, v() 6, ), i=, 2, Node 2 decodes {v() 4,3, v() 5,2 } by canceling v() 6, that is nown locally. Similarly, Node 3 decodes {v () 4,3, v() 6, }, and Node 4 decodes {v() 5,2, v() 6, }. After the second round of data shuffling:

7 7 Node decodes v,6, v 2,5 and v 3,4, Node 2 decodes v,6, v 4,3 and v 5,2, Node 3 decodes v 2,5, v 4,3 and v 6,, Node 4 decodes v 3,4, v 5,2 and v 6,. After two rounds of coded data shuffling, it is not difficult to verify that every node decodes all the required input values for its assigned Reduce functions from the received coded messages, with the help of its locally computed Map functions. C. Communication Load In Step 2 of the above shuffling scheme, since each random linear combination of segments has T r bits, there are a total of S ( S 2 )( r r r S s) η η 2 T bits communicated within S. This is true for all subsets S {,..., } of size max{r +, s} min{r+s,} ( S min{r + s, }, and thus communication load achieved by the CMR scheme is l ) r( l r )( l 2 l s)η r η 2T min{r+s,} l=max{r+,s} QNT = l( l )( r )( l 2 l s) r. r( l=max{r+,s} r )( s ) Example (Coded MapReduce Communication Load). The shuffling scheme of our running example has two rounds, first for the subsets of 3 nodes and second for the set of all 4 nodes. In the first round, each node sends 3 XORed intermediate values, each of T bits. In the second round, each node sends 2 random linear combinations of the segments, each of T 2 bits. 4(3T +2 2 Therefore, the communication load achieved by the CMR scheme is T ) QNT = = 4 9, which matches the upper bound in Theorem 2 for r = s = 2. D. Correctness of CMR We demonstrate the correctness of the CMR scheme by showing ) Each node can decode its needed intermediate values from the received random linear combinations of the segments. 2) After the Shuffle phase, every node has all the needed intermediate values for its assigned output functions. We start by maing the following observations: For each S and S with = r, the set V S\S defined in (9) contains intermediate values of ( r S s) η2 output functions. This is because that the output functions whose intermediate values are included in V S\S should be computed exclusively by the nodes in S\ and a subset of s ( S r) nodes in. Therefore, V S\S contains the intermediate values of a total of ( ) r s ( S r) η2 = ( r S s) η2 output functions. Since every r nodes map a unique batch of η files, V S\S = ( r S s) η η 2 T bits. For each S and Node j S, since there are ( ) S r subsets S S of size r that contain j, and every intermediate value in {V S\S : S, = r, j } has a segment associated with Node j, Uj S defined in (0) contains ( S )( r r S s) η η 2 segments of the intermediate values. Now consider a pair of nodes in S, say Node j and. Out of all the data segments in Uj S, ( )( S 2 r r 2 S s) η η 2 of them are also nown at Node (since there are ( ) S 2 r 2 size-r subsets of S that contain both Node j and Node ), and the rest are needed by Node. Given the above observation that Uj S contains ( )( S r r S s) η η 2 segments, Node j sending [ ( S ) ( r S 2 ) ] ( r ) r 2 η η 2 = ( )( S 2 r r S s) η η 2 random linear combinations of all the segments in Uj S (Step 2) suffice to deliver the ( )( S 2 r r S s) η η 2 segments needed by Node. Moreover, these linear combinations sent by Node j simultaneously deliver the segments needed by Node for all S\{j}, and this is true for all j S. To see how the CMR shuffling scheme delivers all required inputs of the Reduce functions, WLOG we assume that the Reduce function h is to be computed by Node. Then Node will need ( ) r η intermediate values of the output function φ from other nodes (it already nows rn = N ( ) r η intermediate values of φ by mapping the files in M ). By the assignment of the Reduce functions, there exits a subset S 2 of size s containing Node such that all nodes in S 2 need to compute h. Then during the data shuffling for each subset S containing S 2 (note that by the definition of V S\S in (9), the intermediate values of φ will not be communicated to Node if S 2 S because some node outside S also wants to compute h ), there are ( s S r ) subsets S of S with size = r such that / and S\ S 2, and thus Node decodes ( s S r ) η distinct intermediate values of φ. Therefore, the total number of distinct intermediate values of φ Node decodes over the entire Shuffle phase is min{r+s,} ( )( ) ( ) s s η = η, () l r l s r l=max{r+,s}

8 8 which matches the required number of intermediate values. Remar 8. The idea of efficiently creating and exploiting coded multicasting was initially proposed in the context of cache networs in [2], [3], and extended in [4], [5], where caches pre-fetch part of the content in a way to enable coding during the content delivery, minimizing the networ traffic. On top of such coding opportunities due to the redundancy of the locally nown data, when s > (each Reduce function is computed by more than one node), CMR also exploits the naive multicasting opportunities where sending an uncoded intermediate value over the shared lin can be simultaneously useful for at most s nodes, further reducing the communication load. V. CONVERSE OF THEOREM In this section, we prove the lower bound of L (r) in Theorem. As the first step, we present another lower bound of L (r) in the following lemma. Lemma. L (r) tr max t {,...,} ( t). t Proof. We prove Lemma by applying the cut-set bounds on the compound extensions of multiple assignments of the Reduce functions. { Specifically, we} first partition the indices of the Q Reduce functions evenly into groups G,..., G such that G i = (i )Q +,..., iq, for all i {,..., }. We define an assignment of the Reduce functions A = ( ) W A,..., W A where W A, {,..., } is the set of indices of the Reduce functions computed by Node specified by A. We denote the message sent by Node during the Shuffle phase under the assignment A as X A, with a total of RA T bits, for all {,..., }. For a particular integer t {,..., }, we consider the following t assignments of the Reduce functions which are circular shifts of (G,..., G ) with step size t and all result in the same communication load (Remar 2), A = (G, G 2,..., G ), A 2 = (G t+, G t+2,..., G t ),. A t = ( G ( t )t+, G ( t )t+2,..., G ( t )t ). Now we consider the compound setting of all these t assignments at the nodes,..., t. Since (2) j=,..., t =,...,t WAj = {,..., Q}, Nodes,..., t altogether can recover the inputs for all Q Reduce functions using their computed Map functions and the( received messages from the { other nodes, i.e., t+,...,. Thus we}) can consider the cut-set for the nodes,..., t separating {g n : w n M }, X A,..., =,...,t XA t : {t +,..., } from {v q,n : q {,..., Q}, n {,..., N}}, and we have t t Q M T + R Aj T QNT. (3) = j= =t+ Similarly for subsets of t nodes N i {i, (i + ) mod,..., (i + t ) mod }, i =,...,, we have the following cut-set bound for i {,..., }: Q t M T + T QNT. (4) N i j= Summing up these cut-set bounds, we have tq M T + = j= t j= i= R Aj / N i R Aj / N i = T QNT, (5) t tqrnt + ( t) R Aj T (a) QNT, (6) tr + t ( t)l (b), (7) where (a) is by the definition of the computation load and the fact that each R Aj appears t times in the sum. (b) is because that again by Remar 2 the communication load is independent of the assignment of the Reduce functions.

9 9 Since (7) holds for every t {,..., }, we have L (r) max t {,...,} t Next we proceed to prove the lower bound of L (r) in Theorem. Proof. By Lemma we have r ( r ) L (r) tr (8) ( t). r ( r ). (9) tr max t ( t) t {,...,} We recall that r {,..., } and continue to bound the RHS of (9) in the following two regions: ) r We set t = and (9) becomes 2) r > We set t = 2r in (9) to have Comparing (20) and (23) completes the proof. r ( r ) L < (r) r r (20) r ( r ) L ( r )( 2r ) 2r (r) ( 2r r) (2) 2r( r )( 2r ) 2( r) ( 2r r) < ( 2r + )r (22) ( ) = 4 + r 2 < 3 + 5, (23) VI. CONCLUSION We introduce a distributed computing framewor that decomposes the computation job into computing a set of Map and Reduce functions, and formulate the tradeoff between computation and communication within this framewor. We propose Coded MapReduce (CMR), a coded scheme that achieves the optimal tradeoff to within a constant factor for all system parameters, where the achieved communication load is inverse-linearly proportional to the computation load. Compared with the uncoded scheme, CMR reduces the communication load by a factor that is linearly proportional to the computation load. We also prove a lower bound of the minimum communication load, and demonstrate that CMR achieves the lower bound to within a constant factor. REFERENCES [] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Sixth USENIX Symposium on Operating System Design and Implementation, Dec [2] M. Zaharia, M. Chowdhury, M. J. Franlin, S. Shener, and I. Stoica, Spar: cluster computing with woring sets, in Proceedings of the 2nd USENIX HotCloud, vol. 0, June 200, p. 0. [3] F. Ahmad, S. T. Charadhar, A. Raghunathan, and T. Vijayumar, Tarazu: optimizing MapReduce on heterogeneous clusters, in ACM SIGARCH Computer Architecture News, vol. 40, no., Mar. 202, pp [4] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica, Managing data transfers in computer clusters with orchestra, ACM SIGCOMM Computer Communication Review, vol. 4, no. 4, pp , Aug. 20. [5] A. C.-C. Yao, Some complexity questions related to distributive computing (preliminary report), in Proceedings of the eleventh annual ACM symposium on Theory of computing, Apr. 979, pp [6] J. orner and. Marton, How to encode the modulo-two sum of binary sources, IEEE Transactions on Information Theory, vol. 25, no. 2, pp , Mar [7] A. Orlitsy and A. El Gamal, Average and randomized communication complexity, IEEE Transactions on Information Theory, vol. 36, no., pp. 3 6, Jan [8] E. ushilevitz and N. Nisan, Communication Complexity. Cambridge University Press, [9] A. Orlitsy and J. Roche, Coding for computing, IEEE Transactions on Information Theory, vol. 47, no. 3, pp , Mar [0] B. Nazer and M. Gastpar, Computation over multiple-access channels, IEEE Transactions on Information Theory, vol. 53, no. 0, pp , Oct [] A. Ramamoorthy and M. Langberg, Communicating the sum of sources over a networ, IEEE Journal on Selected Areas in Communications, vol. 3, no. 4, pp , Apr [2] M. A. Maddah-Ali and U. Niesen, Fundamental limits of caching, IEEE Transactions on Information Theory, vol. 60, no. 5, pp , Mar [3], Decentralized coded caching attains order-optimal memory-rate tradeoff, IEEE/ACM Transactions on Networing, Apr. 204.

10 [4] M. Ji, G. Caire, and A. F. Molisch, Fundamental limits of caching in wireless D2D networs, IEEE Transactions on Information Theory, vol. 62, no. 2, pp , Feb [5] N. aramchandani, U. Niesen, M. A. Maddah-Ali, and S. Diggavi, Hierarchical coded caching, IEEE International Symposium on Information Theory, pp , June 204. [6] A. Rajaraman and J. D. Ullman, Mining of massive datasets. Cambridge University Press, 20. [7] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, Coded MapReduce, 53rd Annual Allerton Conference on Communication, Control, and Computing, Sept

How to Optimally Allocate Resources for Coded Distributed Computing?

1 How to Optimally Allocate Resources for Coded Distributed Computing? Qian Yu, Songze Li, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr Department of Electrical Engineering, University of Southern