A Fundamental Storage-Communication Tradeoff in Distributed Computing with Straggling Nodes

A Fundamenta Storage-Communication Tradeoff in Distributed Computing with Stragging odes ifa Yan, Michèe Wigger LTCI, Téécom ParisTech 75013 Paris, France Emai: {qifa.yan, michee.wigger} @teecom-paristech.fr Sheng Yang L2S, CentraeSupéec 91190 Gif-sur-Yvette, France Emai: sheng.yang@centraesupeec.fr Xiaohu Tang Information Security and ationa Computing Grid Laboratory, Southwest Jiaotong University, 611756, Chengdu, Sichuan, China Emai: xhutang@swjtu.edu.cn arxiv:1901.07793v1 [cs.it 23 Jan 2019 Abstract The optima storage-computation tradeoff is characterized for a MapReduce-ike distributed computing system with stragging nodes, where ony a part of the nodes can be utiized to compute the desired output functions. The resut hods for arbitrary output functions and thus generaizes previous resuts that restricted to inear functions. Specificay, in this work, we propose a new information-theoretica converse and a new matching coded computing scheme, that we ca coded computing for stragging systems CCS. I. ITRODUCTIO Distributed computing has emerged as one of the most important paradigms to speed up arge-scae data anaysis tasks such as machine earning. A we-known computing framework, MapReduce, can dea with tasks with arge data size. In such systems, the tasks are typicay decomposed into computing map and reduce functions, where the map functions can be computed by different nodes across the network, and the fina outputs are computed by combining the outputs of the map functions by means of reduce functions. Such architectures can be used, for exampe, to perform earning tasks in neura networks [1. Recenty, Li et a. proposed a scheme named coded distributed computing CDC, which through cever coding reduces the communication oad for data shuffing between the map and reduce phases [2. The CDC scheme achieves the optima storage-communication tradeoff, i.e., the smaest communication oad under a tota memory constraint. More recenty, this resut was extended in [3, [4 to account aso for the computation oad during the map phase. In this paper, we consider a variant of this probem where each node takes a random time to compute its desired map functions. In this case, waiting that a nodes have finished their computation can be too time consuming. Instead, the data shuffing and reduce computations are started as soon as any set of nodes, for a constant in {1,..., }, and the number of tota nodes, has terminated the map computations. One of the difficuties in such systems is that when assigning the map computations to the nodes, it is yet uncear which set of nodes wi perform the data shuffing and compute the reduce functions. In this sense, the assignment of map computations needs to be convenient irrespective of the set of nodes that computes the reduce functions. This setup with stragging nodes, hereafter referred to as the stragging system, has been introduced in [5. That work used maximum separabe codes MDS to set up coded computing. Simiar codes were aso used in [6, [7, and it was shown that they achieve the order optima storage-communcation tradeoff in such stragging systems when the goba task is to compute inear functions. Many important tasks in practice, such as computations in neura networks, are however non-inear. This motivates us to investigate the MapReduce framework with stragging nodes for genera map and reduce functions. In this work, we competey characterize the optima storage-communcation tradeoff for systems with stragging nodes. This optima tradeoff is achieved by our new scheme that we name coded computing for stragging systems CCS. We aso present an information-theoretica converse matching our new CCS scheme. Reated to the present work are aso distributed computing schemes with stragging nodes for master-worker network modes, where the worker nodes perform the map computations and the master node the reduce computations. Specificay, [8 [10 proposed coding schemes for systems with inear map and reduce functions and [11 proposed coding schemes for cacuating the derivative of ikeihood functions. otations: For positive integers n, k where k n, we wi use the notations C k n n! k!nk!, [n {1, 2,..., n}, and [k : n {k, k + 1,..., n}. The binary fied is denoted by F 2 and the n dimensiona vector space over F 2 is denoted by F n 2. The operator is used in the foowing way: for a set A, we use A to denote its cardinaity, whie for a signa X, we use X to denote its ength measured in bits. The notation Pr{} is used to denote probabiity of an event. II. SYSTEM MODEL Let,,, D, U, V, W be positive integers. Consider a system that aims to compute D output functions through distributed computing nodes, denoted by {1,..., }. Each output function φ d d [D takes a fies in the ibrary W {w 1,..., w } as inputs, where w n is a fie of size W bits, n [, and outputs a bit stream of ength U, i.e., u d φ d w 1,..., w F U 2, 1

where φ d : F W 2 F U 2. Assume that the computation of each output function φ d can be decomposed as: where φ d w 1,..., w h d f d,1 w 1,..., f d, w, 2 The map function f d,n : F W 2 F V 2 maps the fie w n into a binary stream of ength V, caed intermediate vaues IVA, i.e., v d,n f d,n w n F V 2, n [. 3 The reduce function h d : F V 2 F U 2 maps the IVAs {v d,n } n1 into the output stream u d h d v d,1,..., v d,. 4 Hence, the computation is carried out through three phases. 1 Map Phase: Each node k stores a subset of fies M k W, and tries to compute a the IVAs from the fies in M k, denoted by C k : C k {v d,n : d [D, w n M k }. 5 Each node has a random time to compute its corresponding IVAs. To imit atency of the system, the distributed computing scheme proceeds with the shuffe and reduce phase as soon as a fixed number of nodes has terminated the map computations. This subset of nodes wi be caed active set. otice that is a fixed system parameter. Moreover, for simpicity we assume that each subset of size is active with same probabiity. Define Ω t {T : T t}, t [, 6 and et denote the active set. Then, Pr { } 1 C, Ω. 7 The output functions φ 1,..., φ D are then uniformy assigned to the nodes in. Denote the indices of the output functions assigned to a given node k by D k. 1 Thus, D A {D k } k A forms a partition of [D, and each D k is of cardinaity D. Denote the set of a uniform partitions of [D by A. 2 Shuffe Phase: The nodes in proceed to exchange their computed IVAs. Each node k muticasts a signa X A k ϕ k C k, D A 8 to a other nodes in, where ϕ k : F C k V 2 A F XA 2 is the encoding function of node k. Thus each active node k receives the signas X A { } X A k : k errorfree. 3 Reduce Phase: With the received signas X A exchanged during the shuffe phase and the IVAs C k computed ocay during the map phase, node k restores a the IVAs {v d,1,..., v d, } d Dk k ψ k X, C k, D A, 9 1 Here we assume for simpicity that divides D. ote that otherwise we can aways add empty functions for the assumption to hod. where ψ k k X : F A k 2 F C k V 2 A F DV Subsequenty, it proceeds to compute 2. u d h d v d,1,..., v d,, d D k. 10 Remark 1: otice that a decomposition into map and reduce functions is aways possibe. In fact, triviay, one can set the map and reduce functions to be the identity and output functions respectivey, i.e., f d,n w n w n, and h d φ d, n [, d [D, in which case V W. However, to mitigate the communication cost in shuffe phase, one woud prefer a decomposition such that the ength of the IVAs is sma but suffices to compute the fina outputs. To measure the storage and computation costs, we introduce the foowing definitions: Definition 1 Storage Space: Storage space r is defined as the tota number of fies stored across the nodes normaized by the tota number of fies : k1 r M k. 11 Definition 2 Communication Load: Communication oad L is defined as the average tota number of bits sent in the shuffe phase, normaized by the tota number of bits of a intermediated vaues, i.e., L E [ k X A DV k, 12 where the expectation is taken over the active set. Definition 3 Storage-Communication Tradeoff: A pair of rea numbers r, L is feasibe if for any ɛ > 0, there exists an integer, a storage design {M k } k1 of storage space ess than r + ɛ, a set of uniform assignment of output functions {D } Ω, a coection of encoding functions {{ k} } ϕ with communication oad ess than L+ɛ, k Ω such that a the output functions φ 1,..., φ D can be computed successfuy. For a fixed [, we define the storagecommunication tradeoff as L r inf {L : r, L is achievabe}. 13 The goa of this paper is to characterize L r for r [ + 1, for each [. otice that, the interesting interva for r is [ + 1,. In fact, for r >, each node can triviay store a the fies. Moreover, a pair r, L can ony be feasibe if each fie is stored at east once in any subset of nodes. If r < + 1 then this is not possibe. III. MAI RESULT We summarize our main resut in the foowing theorem. Theorem 1: For a distributed computing system with nodes and any fixed [, for an integer storage space r [ + 1 :, L r 1 r min{r,1} 1 C r C 1 r1 C 1. 14 r+ 1

For a genera r [ +1,, L r is given by the ower convex enveope formed by the above points. Theorem 1 characterizes the storage-communication tradeoff { for a positive integers. Fig. 1 shows the curves L r : [ } for 10. otice that, when 1, the curve reduces to a singe point, 0, whie when, the curve corresponds to the optima tradeoff without stragging node obtained in [2. In fact, in the case, our proposed scheme becomes the CDC scheme in [2. For any j and any T Ω,r with j / T, eveny spit the IVAs V T,j into disjoint packets that we abe as {VT i,j : i T }. So, V T,j {V i T,j : i T }. 19 Each node k then muticasts the bit-wise XOR XT k VT k \{j} {k},j, T Ω,r s.t. k / T. 20 j T ode k can compute a these signas because it can compute a IVAs { V T,j : T Ω,r and k T }, see 16. 3 Reduce Phase: Any given node k can recover a IVAs {V T,k : T Ω r, k T } 21 ocay, see again 16. To obtain the missing IVAs {V T,k : T Ω r, k / T }, 22 Fig. 1. Storage-Communication Tradeoff L r for [ when 10. IV. ACHIEVABILITY In this section, we wi first present our proposed scheme, and then anayze its performance to prove the achievabiity part of Theorem 1. A. Coded Computing for Stragging Systems CCS Let r be an integer vaue in [ + 1 :. Consider a mutipe of C r. Partition the fies into Cr batches, each containing η r C fies. Each batch is then associated with r a subset T of of cardinaity r. Let W T denote the batch of η r fies associated with set T. Then, W {w 1,..., w } W T. 15 T Ω r We now describe the map, shuffe, and reduce procedures of the CCS scheme. 1 Map Phase: Each node k stores batches W T such that k T, i.e., M k W T, 16 T :T Ω r, k T and tries to compute a IVAs in 5. 2 Shuffe Phase: Let be the active set. Pick any uniform assignment of output functions D {D k } k. For any subset T of size r and any k, et V T,k {v d,n : d D k, w n W T }. 17 The shuffe phase is accompished in min{+1, r} rounds, where each round is indexed by an integer [ r + : min{r, 1}. Consider round, where we define a subsets of of size r with exacty active nodes: Ω,r {T : T r, T }. 18 it again proceeds in rounds r +,..., min{r, 1}. In round, it forms for each T Ω,r s.t. k / T and j T the IVA packet V j T,k V j T \{i} {k},i Xj T \{j} {k}. 23 otice that, i T \{j} 1 ode k can form the signa 23 since it obtained the signa X j T \{j} {k} from node j, and it can compute the IVAs { V j T \{i} {k},i : i T \{j} } are ocay. 2 ode k can decode a the IVAs in 22, since for any T Ω r such that k / T, r + T min{r, 1}, and thus, T must be an eement of Ω,r, for some [r + : min{r, 1}. Finay, it computes the output functions through 10. Remark 2: By 16, the storage design {M k } k1 does not depend on the parameter but ony on the avaiabe storage space r. The proposed storage design is thus universay optima irrespective of the size of the active set. In practice, the map phase can thus be carried out even without knowing how many nodes wi be participating in the reduce phase. B. Performance Anaysis By 16, each node k stores C r1 1 batches, each of which containing η r fies. Thus, the storage space is k1 M k 1 r1 ηr r. 24 When the active set is, denote the communication oad of the CCS scheme by L. For each [ r+ : min{r, 1}, by 19 and 20, there are in tota C Cr signas sent in the round, each of size ηr D V bits. ote that L does not depend on the reaization of the active set

but ony on its size. Hence, the average communication oad is L E [L 25 min{r,1} r+ C Cr ηr D DV V 26 1 r min{r,1} 1 C r C 1 r1 C 1. 27 r+ 1 We have proved the achievabiity when r [ + 1 :. For genera r [ + 1,, the ower convex enveope can be achieved with memory- and time- sharing. I : I s V. COVERSE Let {M k } k1 be a storage design and r, L be a storagecommunication pair achieved based on {M k } k1. For each s [ + 1 :, define M k M k, 28 k I k \I i.e., is the number of fies stored s times across a the nodes. Then by definition, satisfies I : I s+1 s+1 For any Ω and any [, define b M, M k k I 0, 29, 30 s r. 31 k \I M k, 32 i.e., b M, is the number of fies stored exacty times in the nodes of. Since any fie that is stored s times across the a nodes, has occurrences in exacty Cs C s subsets of Ω, we have Ω b M, + smax{,+1} C s C s. 33 Consider the case that the active set is, a the output functions are computed through the nodes in. For the sub-system with computing nodes in, under the storage {M k } k, the optima communication oad L M, has the foowing ower bound by [2, Lemma 1: L M, 1 b M,. 34 Then the communication oad [ k L E X k 35 DV [ k E X k DV Pr{ } 36 Ω L M, Pr{ } 37 Ω 1 C 1 a C 1 Ω 1 1 s+1 Ω + b M, b M, smax{,+1} min{s,} s+ C sc s C 38 39 40 C sc s, 41 C where a foows from 33. Let Z x be the function defined over the interva [ + 1, by connecting the points s, Z s sequentiay, where min{s,} Z s s+ C s C s, 42 is defined for any s [ + 1 :. In the Appendix, we wi prove the foowing caim. Caim 1: The function Z x is a convex decreasing function over [ + 1,. a By 29 and 30, M,s are coefficients between 0 and 1 whose summation is 1. Hence by 41 and Caim 1, we have L 1 C 1 C Z s+1 Z s 43 s 44 s+1 Z r C. 45 In particuar, when r [ + 1 :, L min{r,} +r C r C r C 46 1 r min{r,1} 1 C r C 1 r1 C 1. 47 r+ 1 As a resut, by 45, 47 and Caim 1, we have proved the ower bound for any r [ + 1,.

ACOWLEDGMET The work of. Yan and M. Wigger has been supported by the ERC under grant agreement 715111. APPEDIX PROOF OF CLAIM 1 { We prove Caim 1 by verifying that, the sequence Z s} is a discrete monotonicay decreasing s+1 convex sequence, i.e, Z s + 1 Z s < 0, 48 Z s + 1 Z s > Z s Z s 1. 49 otice that by 42, min{s,} Z s a s+ min{s,} s+ C s C s min{s,} s+ C s C s 50 Cs C s C, 51 where in a, we used the equaity min{s,} s+ C s C s C. Thus the constant C shows up in both Z k s + 1 and s for a s [ + 1 : 1, then Z k a Z s + 1 Z s 52 min{s+1,} s+1+ min{s+1,} s+1+ min{s+1,} s+1+ min{s,1} b s+ min{s,} s+ min{s,1} s+ C s+1c s1 min{s,} s+ C s + Cs 1 C s1 Cs C s1 + C1 s1 Cs 1 C min{s,1} s1 s+ CsC 1 min{s,1} s1 + 1 s+ C sc 1 s1 + 1 C sc s 53 54 CsC 1 s1 55 C sc 1 s1 56 57 < 0, 58 where in a, we appied the equaity Cn+1 m+1 Cm+1 n + Cn m and in b, we used the variabe change 1 in the first summation. Moreover, foowing simiar steps as above, by 57, for s [ + 2 : 1, we can derive Z s + 1 Z s Z s Z s 1 59 min{s,}1 +s1 min{s,}1 +s1 C s1c 1 s + 1 C s1 min{s,1} +s min{s1,2} +s1 min{s,1} +s min{s,1} s+ min{s,1} +s C 1 s1 + C2 s1 + 1 CsC 1 s1 + 1 C s1 + Cs1 1 C 1 s1 + 1 Cs1C 2 min{s,1} s1 C 1 + 1 + 1 +s s1 C1 s1 Cs1 1 C1 min{s,1} s1 C 1 1 + 1 +s 2C 1 s1 C1 s1 1 + 1 s1 C1 s1 60 61 62 63 64 > 0. 65 We have proved 48 and 49. As a resut, the piecewise inear function Z x connecting points { s, Z s} s+1 sequentiay is a convex function. REFERECES [1 Y. Liu, J. Yang, Y. Huang, L. Xu, S. Li, and M. i, MapReduce based parae neura networks in enabing arge scae machine earning," Comput. Inte. eurosci., 2015. [2 S. Li, M. A. Maddah-Ai,. Yu, and A. S. Avestimehr, A fundamenta tradeoff between computation and communication in distributed computing," IEEE Trans. Inf. Theory, vo. 64, no. 1, pp. 109 128, Jan. 2018. [3. Yan, S. Yang, and M. Wigger, Storage, computation, and communication: A fundamenta tradeoff in distributed computing," arxiv:1806.07565 [4 Y. H. Ezzedin, M. armoose, and C. Fragoui, Communication vs distributed computation: An aternative trade-off curve," in Proc. IEEE Inf. Theory Workshop ITW, aohsiung, Taiwan, pp. 279 283, ov. 2017. [5. Lee, M. Lam, R. Pedarsani, D. Papaiiopouos, and. Ramchandran, Speeding up distributed machine earning using codes," IEEE Trans. Inf. Theory, vo. 64, no. 3, pp. 1514 1529, Mar. 2018. [6 S. Li, M. A. Maddah-Ai, and A. S. Avestimehr, A unified coding framework for distributed computing with stragging servers," in Proc. IEEE Gobecom Works GC Wkshps, Washington, DC, USA, Dec. 2016. [7 J. Zhang and O. Simeone, Improved atency-communication trade-off for map-shuffe-reduce systems with straggers," arxiv:1808.06583 [8. Yu, M. Maddah-Ai, and S. Avestimehr, Poynomia codes: An optima design for high-dimensiona coded matrix mutipication," in Proc. The 31st Annua Conf. eura Inf. Processing System IPS, Long Beach, CA, USA, May 2017. [9. Yu, M. Maddah-Ai, and S. Avestimehr, Stragger mitigation in distributed matrix mutipication: Fundamenta imits and optima coding," in Proc. IEEE Int. Symp. Inf. Theory, Vai, CO, USA, pp. 2022 2026, Jun. 2018. [10 T. Baharav,. Lee, O. Oca, and. Ramchandran, Stragger-proofing massive-scae distributed matrix mutipication with d-dimensiona product codes," in Proc. IEEE Int. Symp. Inf. Theory, Vai, CO, USA, pp. 1993 1997, Jun. 2018. [11. Raviv, I. Tamo, R. Tandon, and A. G. Dimakis, Gradient coding from cycic MDS codes and expander graphs," arxiv: 1707.03858.