MANY enterprises, including Google, Facebook, Amazon. Capacity of Clustered Distributed Storage

Size: px

Start display at page:

Download "MANY enterprises, including Google, Facebook, Amazon. Capacity of Clustered Distributed Storage"

Ethan Copeland
5 years ago
Views:

1 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY Capacity of Clustered Distributed Storage Jy-yong Sohn, Student ember, IEEE, Beongjun Choi, Student ember, IEEE, Sung Whan Yoon, ember, IEEE, and Jaekyun oon, Fellow, IEEE Abstract A new system model reflecting the clustered structure of distributed storage is suggested to investigate interplay between storage overhead and repair bandwidth as storage node failures occur. Large data centers with multiple racks/disks or local networks of storage devices (e.g. sensor network are good applications of the suggested clustered model. In realistic scenarios involving clustered storage structures, repairing storage nodes using intact nodes residing in other clusters is more bandwidth-consuming than restoring nodes based on information from intra-cluster nodes. Therefore, it is important to differentiate between intra-cluster repair bandwidth and cross-cluster repair bandwidth in modeling distributed storage. Capacity of the suggested model is obtained as a function of fundamental resources of distributed storage systems, namely, node storage capacity, intra-cluster repair bandwidth and cross-cluster repair bandwidth. The capacity is shown to be asymptotically equivalent to a monotonic decreasing function of number of clusters, as the number of storage nodes increases without bound. Based on the capacity expression, feasible sets of required resources which enable reliable storage are obtained in a closed-form solution. Specifically, it is shown that the cross-cluster traffic can be minimized to zero (i.e., intra-cluster local repair becomes possible by allowing extra resources on storage capacity and intra-cluster repair bandwidth, according to the law specified in the closed-form. The network coding schemes with zero crosscluster traffic are defined as intra-cluster repairable codes, which are shown to be a class of the previously developed locally repairable codes. Index Terms Capacity, Distributed storage, Network coding I. INTRODUCTION ANY enterprises, including Google, Facebook, Amazon and icrosoft, use cloud storage systems in order to support massive amounts of data storage requests from clients. In the emerging Internet-of-Thing (IoT era, the number of devices which generate data and connect to the network increases exponentially, so that efficient management of data center becomes a formidable challenge. However, since cloud storage systems are composed of inexpensive commodity disks, failure events occur frequently, degrading the system reliability []. In order to ensure reliability of cloud storage, distributed storage systems (DSSs with erasure coding have been considered to improve tolerance against storage node failures [3] [8]. In such systems, the original file is encoded and distributed into multiple storage nodes. When a node fails, The authors are with the School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, 344, Republic of Korea ( {jysohn08, bbzang0, shyoon8}@kaist.ac.kr, jmoon@kaist.edu. A part of this paper was presented [] at the IEEE Conference on Communications (ICC, Paris, France, ay -5, 07. This work is in part supported by the National Research Foundation of Korea under Grant No. 06RAB4098, and in part supported by the ICT R&D program of SIP/IITP [ , Research on Adaptive achine Learning Technology Development for Intelligent Autonomous Digital Companion]. a newcomer node regenerates the failed node by contacting a number of survived nodes. This causes traffic burden across the network, taking up significant repair bandwidth. Earlier distributed storage systems utilized the 3-replication code: the original file was replicated three times, and the replicas were stored in three distinct nodes. The 3-replication coded systems require the minimum repair bandwidth, but incur high storage overhead. Reed-Solomon (RS codes are also used (e.g. HDFS-RAID in Facebook [9], which allow minimum storage overhead; however, RS-coded systems suffer from high repair bandwidth. The pioneering work of [0] on distributed storage systems focused on the relationship between two required resources, the storage capacity α of each node and the repair bandwidth γ, when the system aims to reliably store a file under node failure events. The optimal (α, γ pairs are shown to have a fundamental trade-off relationship, to satisfy the maximumdistance-separable (DS property (i.e., any k out of n storage nodes can be accessed to recover the original file of the system. oreover, the authors of [0] obtained capacity C, the maximum amount of reliably storable data, as a function of α and γ. The authors related the failure-repair process of a DSS with the multi-casting problem in network information theory, and exploited the fact that a cut-set bound is achievable by network coding []. Since the theoretical results of [0], explicit network coding schemes [] [4] which achieve the optimal (α, γ pairs have also been suggested. These results are based on the assumption of homogeneous systems, i.e., each node has the same storage capacity and repair bandwidth. However, in real data centers, storage nodes are dispersed into multiple clusters (in the form of disks or racks [7], [8], [5], allowing high reliability against both node and rack failure events. In this clustered system, repairing a failed node gives rise to both intra-cluster and cross-cluster repair traffic. While the current data centers have abundant intrarack communication bandwidth, cross-rack communication is typically limited. According to [6], nearly a 80TB of crossrack repair bandwidth is required everyday in the Facebook warehouse, limiting cross-rack communication for foreground map-reduce jobs. oreover, surveys [7] [9] on network traffic within data centers show that cross-rack communication is oversubscribed; the available cross-rack communication bandwidth is typically 5 0 times lower than the intra-rack bandwidth in practical systems. Thus, a new system model which reflects the imbalance between intra- and cross-cluster repair bandwidths is required. A. ain Contributions This paper suggests a new system model for clustered DSS to reflect the clustered nature of real distributed storage

2 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY systems wherein an imbalance exists between intra- and crosscluster repair burdens. This model can be applied to not only large data centers, but also local networks of storage devices such as the sensor networks or home clouds which are expected to be prevalent in the IoT era. This model is also more general in the sense that when the intra- and crosscluster repair bandwidths are set to be equal, the resulting structure reduces to the original DSS model of [0]. This paper only considers recovering a single node failure at a time, as in [0]. The main contributions of this paper can be seen as twofold: one is the derivation of a closed-form expression for capacity, and the other is the analysis on feasible sets of system resources which enable reliable storage. Closed-form Expression for Capacity: Under the setting of functional repair, storage capacity C of the clustered DSS is obtained as a function of node storage capacity α, intracluster repair bandwidth β I and cross-cluster repair bandwidth β c. The existence of the cluster structure manifested as the imbalance between intra/cross-cluster traffics makes the capacity analysis challenging; Dimakis proof in [0] cannot be directly extended to handle the problem at hand. We show that symmetric repair (obtaining the same amount of information from each helper node is optimal in the sense of maximizing capacity given the storage node size and total repair bandwidth, as also shown in [0] for the case of varying repair bandwidth across the nodes. However, we stress that in most practical scenarios, the need is greater for reducing cross-cluster communication burden, and we show that this is possible by trading with reduced overall storage capacity and/or increasing intra-repair bandwidth. Based on the derived capacity expression, we analyzed how the storage capacity C changes as a function of L, the number of clusters. It is shown that the capacity is asymptotically equivalent to C, some monotonic decreasing function of L. Analysis on Feasible (α, β I, β c Points: Given the need for reliably storing file, the set of required resource pairs, node storage capacity α, intra-cluster repair bandwidth β I and cross-cluster repair bandwidth β c, which enables C(α, β I, β c is obtained in a closed-form solution. In the analysis, we introduce ɛ β c /β I, a useful parameter which measures the ratio of the cross-cluster repair burden (per node to the intra-cluster burden. This parameter represents how scarce the available cross-cluster bandwidth is, compared to the abundant intra-cluster bandwidth. We here stress that the special case of ɛ 0 corresponds to the scenario where repair is done only locally via intra-cluster communication, i.e., when a node fails the repair process requires intra-cluster traffic only without any cross-cluster traffic. Thus, the analysis on the ɛ 0 case provides a guidance on the network coding for data centers for the scenarios where the available crosscluster (cross-rack bandwidth is very scarce. Similar to the non-clustered case of [0], the required node storage capacity and the required repair bandwidth show a trade-off relationship. In the trade-off curve, two extremal points - the minimum-bandwidth-regenerating (BR point and the minimum-storage-regenerating (SR point - have been further analyzed for various ɛ values. oreover, from the analysis on the trade-off curve, it is shown that the minimum storage overhead α /k is achievable if and only if ɛ n k. This implies that in order to reliably store file with minimum storage α /k, sufficiently large cross-cluster repair bandwidth satisfying ɛ n k is required. Finally, for the scenarios with the abundant intracluster repair bandwidth, the minimum required cross-cluster repair bandwidth β c to reliably store file is obtained as a function of node storage capacity α. B. Related Works Several researchers analyzed practical distributed storage systems with a goal in mind to reflect the non-homogeneous nature of storage nodes [0] [3]. A heterogeneous model was considered in [0], [] where the storage capacity and the repair bandwidth for newcomer nodes are generally nonuniform. Upper/lower capacity bounds for the heterogeneous DSS are obtained in [0]. An asymmetric repair process is considered in [], coining the terms, cheap and expensive nodes, based on the amount of data that can be transfered to any newcomer. The authors of [3] considered a flexible distributed storage system where the amount of information from helper nodes may be non-uniform, as long as the total repair bandwidth is bounded from above. The view points taken in these works are different from ours in that we adopt a notion of cluster and introduce imbalance between intra- and cross-cluster repair burdens. Recently, some researchers considered the clustered structure of data centers [], [4] [3]. Some recent works [4] [7] provided new system models for clustered DSS and shed light on fundamental aspects of the suggested system. In [4], the idea of [] is developed to a two-rack system, by setting the communication burden within a rack much lower than the burden across different racks, similar to our analysis. However, the authors of [4] only considered systems with two racks, while the current paper considers a general setting of L racks (clusters, and provides mathematical analysis on how the number of clusters (i.e., the dispersion of nodes affects the capacity of clustered distributed storage. Similar to the present paper, the authors of [5], [6] obtained the capacity of clustered distributed storage, and provided capacity-achieving regenerating coding schemes. However, the coding schemes considered in [5], [6] do not satisfy the DS property, and the capacity expression is obtained for limited scenarios when intra-cluster repair bandwidth β I is set to its maximum value. In contrast, the current paper provides the capacity expression for general values of β I, β c parameters, and analyzes the behavior of capacity as a function of the ratio ɛ β c /β I between intra- and cross-cluster repair bandwidths. oreover, unlike in the previous work, the capacity-achieving coding schemes (whose existence is shown here satisfy the DS property. In [7], the security issue in clustered distributed storage systems is considered, and the maximum amount of securely storable data in the existence of passive eavesdroppers is obtained. There also have been some recent works [8] [3] on network code design appropriate for clustered distributed storage. otivated by the limited available cross-rack repair bandwidth

TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 3 in real data centers, the work of [8] provides a network coding scheme which minimizes the cross-rack bandwidth in clustered distributed storage

3 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 3 in real data centers, the work of [8] provides a network coding scheme which minimizes the cross-rack bandwidth in clustered distributed storage systems. However, the suggested coding scheme is applicable for some limited (n, k, L parameters and the minimum storage overhead (α /k setting. On the other hand, the current paper provides the capacity expression for general (n, k, L, α setting, and proves the existence of capacity-achieving coding scheme. The authors of [9] proposed coding schemes tolerant to rack failure events in multi-rack storage systems, but have not addressed the imbalance between intra- and cross-cluster repair burdens in node failure events, which is an important aspect of the current paper. The authors of [30] considers the scenario of having grouped (clustered storage nodes where nodes in the same group are more accessible to each other, compared to the nodes in other groups. However, the focus is different to the present paper: [30] focuses on the code construction which has the minimum amount of accessed data (called the optimal access property, while the scope of the present paper is on finding the optimal trade-off between the node storage capacity and the repair bandwidth, as a function of the ratio ɛ of intraand cross-cluster communication burdens. Finally, a locally repairable code which can repair arbitrary node within each group is suggested in [3]; this code can suppress the intergroup repair bandwidth to zero. However, the coding scheme is suggested for the ɛ 0 case only, while the present paper provides the capacity expression for general 0 ɛ, and proves the existence of an optimal coding scheme. Compared to the conference version [] of the current work, this paper provides the formal proofs for the capacity expression, and obtains the feasible (α, γ region for 0 ɛ setting (only ɛ 0 is considered in []. The present paper also shows the behavior of capacity as a function of L, the number of clusters, and provides the sufficient and necessary conditions on ɛ β c /β I, to achieve the minimum storage overhead α /k. Finally, the asymptotic behaviors of the BR/SR points are investigated in this paper, and the connection between what we call the intra-cluster repairable codes and the existing locally repairable codes [3] is revealed. C. Organization This paper is organized as follows. Section II reviews preliminary materials about distributed storage systems and the information flow graph, an efficient tool for analyzing DSS. Section III proposes a new system model for the clustered DSS, and derives a closed-form expression for the storage capacity of the clustered DSS. The behavior of the capacity curves is also analyzed in this section. Based on the capacity expression, Section IV provides results on the feasible resource pairs which enable reliable storage of a given file. Further research topics on clustered DSS are discussed in Section V, and Section VI draws conclusions. The reason why the present paper considers this regime is provided in Section III-B. S xx iiii xx iiii xx iiii 3 xx iiii 4 αα αα αα αα xx oooooo xx oooooo 3 xx oooooo 4 xx oooooo ββ ββ ββ xx iiii 5 αα 5 xx oooooo Fig. : Information flow graph (n 4, k 3, d 3 A. Distributed Storage System II. BACKGROUND Distributed storage systems can maintain reliability by means of erasure coding [33]. The original data file is spread into n potentially unreliable nodes, each with storage size α. When a node fails, it is regenerated by contacting d < n helper nodes and obtaining a particular amount of data, β, from each helper node. The amount of communication burden imposed by one failure event is called the repair bandwidth, denoted as γ dβ. When the client requests a retrieval of the original file, assuming all failed nodes have been repaired, access to any k < n out of n nodes must guarantee a file recovery. The ability to recover the original data using any k < n out of n nodes is called the maximal-distance-separable (DS property. Distributed storage systems can be used in many applications such as large data centers, peer-to-peer storage systems and wireless sensor networks [0]. B. Information Flow Graph Information flow graph is a useful tool to analyze the amount of information flow from source to data collector in a DSS, as utilized in [0]. It is a directed graph consisting of three types of nodes: data source S, data collector DC, and storage nodes x i as shown in Fig.. Storage node x i can be viewed as consisting of input-node x i in and output-node xi out, which are responsible for the incoming and outgoing edges, respectively. x i in and xi out are connected by a directed edge with capacity identical to the storage size α of node x i. Data from source S is stored into n nodes. This process is represented by n edges going from S to {x i } n, where each edge capacity is set to infinity. A failure/repair process in a DSS can be described as follows. When a node x j fails, a new node x n+ joins the graph by connecting edges from d survived nodes, where each edge has capacity β. After all repairs are done, data collector DC chooses arbitrary k nodes to retrieve data, as illustrated by the edges connected from k survived nodes with infinite edge capacity. Fig. gives an example of information flow graph representing a distributed storage system with n 4, k 3, d 3. C. Notation used in the paper This paper requires many notations related to graphs, because it deals with information flow graphs. Here we provide the definition of each notation used in the paper. For the given system parameters, we denote G as the set of all possible DC

4 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 4 S nn nn II nn/ll st cluster nd cluster 3 rd cluster Fig. : Clustered distributed storage system (n 5, L 3 information flow graphs. A graph G G is denoted as G (V, E where V is the set of vertices and E is the set of edges in the graph. For a given graph G G, we call a set c E of edges as cut-set [34] if it satisfies the following: every directed path from S to DC includes at least one edge in c. An arbitrary cut-set c is usually denoted as c (U, U where U V and U V \ U (the complement of U satisfy the following: the set of edges from U to U is the given cut-set c. The set of all cut-sets available in G is denoted as C(G. For a graph G G and a cut-set c C(G, we denote the sum of edge capacities for edges in c as w(g, c, which is called the cut-value of c. A vector is denoted as v using the bold notation. For a vector v, the transpose of the vector is denoted as v T. A set is denoted as X {x, x,, x k }, while a sequence x, x,, x N is denoted as (x n N n, or simply (x n. For given sequences (a n and (b n, we use the term a n is asymptotically equivalent to b n [35] if and only if a n lim. ( n b n We utilize a useful notation: {, if i j ij 0, otherwise. For a positive integer n, we use [n] as a simplified notation for the set {,,, n}. For a non-positive integer n, we define [n]. Each storage node is represented as either x t (x t in, xt out as defined in Section II-B, or N(i, j as defined in (A.. Finally, important parameters used in this paper are summarized in Table I. III. CAPACITY OF CLUSTERED DSS A. Clustered Distributed Storage System A distributed storage system with multiple clusters is shown in Fig.. Data from source S is stored at n nodes which are grouped into L clusters. The number of nodes in each cluster is fixed and denoted as n I n/l. The storage size of each node is denoted as α. When a node fails, a newcomer node is regenerated by contacting d I helper nodes within the same cluster, and d c helper nodes from other clusters. This paper considers functional repair [33] in the regeneration process; the newcomer node may store different content from that of the failed node, while maintaining the DS property of the code. The amount of data a newcomer node receives within the same cluster is γ I d I β I (each node equally contributes to β I, and that from other clusters is γ c d c β c (each node equally contributes to β c. Fig. 3 illustrates an example of kk DC TABLE I: Parameters used in this paper n number of storage nodes k number of DC-contacting nodes L number of clusters n I n/l number of nodes in a cluster d I n I number of intra-cluster helper nodes d c n n I number of cross-cluster helper nodes F q base field which contains each symbol α storage capacity of each node β I intra-cluster repair bandwidth (per node β c cross-cluster repair bandwidth (per node γ I d I β I intra-cluster repair bandwidth γ c d c β c cross-cluster repair bandwidth γ γ I + γ c repair bandwidth ɛ β c /β I ratio of β c to β I (0 ɛ ξ γ c /γ ratio of γ c to γ (0 ξ < R k/n ratio of k to n (0 < R S x in x in x in 3 x in 4 α α α α x out x out 3 x out 4 x out β c β I β c Newcomer x in 5 α 5 x out st cluster nd cluster Fig. 3: Repair process in clustered DSS (n 4, L, d I, d c information flow graph representing the repair process in a clustered DSS. B. Assumptions for the System We assume that d c and d I have the maximum possible values (d c n n I, d I n I, since this is the capacitymaximizing choice, as formally stated in the following proposition. The proof of the proposition is in Appendix G-A. Proposition. Consider a clustered distributed storage system with given γ and γ c. Then, setting both d I and d c to their maximum values maximizes storage capacity. Note that the authors of [0] already showed that in the nonclustered scenario with given repair bandwidth γ, maximizing the number of helper nodes d is the capacity-maximizing choice. Here, we are saying that a similar property also holds for clustered scenario considered in the present paper. Under the setting of the maximum number of helper nodes, the overall repair bandwidth for a failure event is denoted as γ γ I + γ c (n I β I + (n n I β c ( Data collector DC contacts any k out of n nodes in the clustered DSS. Given that the typical intra-cluster communication bandwidth is larger than the cross-cluster bandwidth in real systems, we assume β I β c

5 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 5 throughout the present paper; this assumption limits our interest to ɛ. oreover, motivated by the security issue, we assume that a file cannot be retrieved entirely by contacting any single cluster having n I nodes. Thus, the number k of nodes contacted by the data collector satisfies We also assume k > n I. (3 L (4 which holds for most real DSSs. Usually, all storage nodes cannot be squeezed in a single cluster, i.e., L rarely happens in practical systems, to prevent losing everything when the cluster is destroyed. Note that many storage systems [7], [8], [6] including those of Facebook uses L n, i.e., every storage node reside in different racks (clusters, to tolerate the rack failure events. Finally, according to [6], nearly 98% of data recoveries in real systems deal with single node recovery. In other words, the portion of simultaneous multiple nodes failure events is small. Therefore, the present paper focuses single node failure events. C. The closed-form solution for Capacity Consider a clustered DSS with fixed n, k, L values. In this model, we want to find the set of feasible parameters (α, β I, β c which enables storing data of size. In order to find the feasible set, min-cut analysis on the information flow graph is required, similar to [0]. Depending on the failure-repair process and k nodes contacted by DC, various information flow graphs can be obtained. Let G be the set of all possible flow graphs. Consider a graph G G with minimum min-cut, the construction of which is specified in Appendix A. Based on the max-flow min-cut theorem in [], the maximum information flow from source to data collector for arbitrary G G is greater than or equal to C(α, β I, β c : min-cut of G, which is called the capacity of the system. In order to send data from the source to the data collector, C should be satisfied. oreover, if C is satisfied, there exists a linear network coding scheme [] to store a file with size. Therefore, the set of (α, β I, β c points which satisfies C is feasible in the sense of reliably storing the original file of size. Now, we state our main result in the form of a theorem which offers a closed-form solution for the capacity C(α, β I, β c of the clustered DSS. Note that setting β I β c reduces to capacity of the non-clustered DSS obtained in [0]. Theorem. The capacity of the clustered distributed storage system with parameters (n, k, L, α, β I, β c is g i i C(α, β I, β c min{α, ρ i β I +(n ρ i j g m β c }, j m (5 where ρ i n I i, (6 { k n g m I +, m (k mod n I k (7 n I, otherwise. The proof is in Appendix A. Note that the parameters used in the statement of Theorem have the following property, the proof of which is in Appendix G-B. Proposition. For every (i, j with i [n I ], j [g i ], we have oreover, holds. ρ i β I + (n ρ i j m i m g m β c γ. (8 g m k (9 D. Relationship between C and ɛ β c /β I In this subsection, we analyze the capacity of a clustered DSS as a function of an important parameter ɛ : β c /β I, (0 the cross-cluster repair burden per intra-cluster repair burden. In Fig. 4, capacity is plotted as a function of ɛ. From (, the total repair bandwidth can be expressed as γ γ I + γ c (n I β I + (n n I β c ( n I + (n n I ɛ β I. ( Using this expression, the capacity is expressed as g i C(ɛ j min {α, (n ρ i j i m g mɛ + ρ i (n n I ɛ + n I } γ. ( For fair comparison on various ɛ values, capacity is calculated for a fixed (n, k, L, α, γ set. The capacity is an increasing function of ɛ as shown in Fig. 4. This implies that for given resources α and γ, allowing a larger β c (until it reaches β I is always beneficial, in terms of storing a larger file. For example, under the setting in Fig. 4, allowing β c β I (i.e., ɛ can store 48, while setting β c 0 (i.e., ɛ 0 cannot achieve the same level of storage. This result is consistent with the previous work on asymmetric repair in [0], which proved that the symmetric repair maximizes capacity. Therefore, when the total communication amount γ is fixed, a loss of storage capacity is the cost we need to pay in order to reduce the communication burden β c across different clusters. E. Relationship between C and L In this subsection, we analyze the capacity of a clustered DSS as a function of L, the number of clusters. For fair comparison, (n, k, α, γ values are fixed for calculating capacity. In Fig. 5, capacity curves for two scenarios are plotted over a range of L values. First, the solid line corresponds to the scenario when the system has abundant cross-rack bandwidth

6 Capacity C(, I, c TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 6 Capacity C(, I, c Ratio of cross-/intra-cluster repair bandwidth ( Fig. 4: Capacity as a function of ɛ, under the setting of n 00, k 85, L 0, α, γ When the available is abundant (No over-subscription problem c When the system suffers from over-subscription problem ( / Number of Clusters (L Fig. 5: Capacity as a function of L, under the setting of n 00, k 80, α, γ 0 resources γ c. In this ideal scenario which does not suffer from the over-subscription problem, the system can store 80 irrespective of the dispersion of nodes. However, consider a practical situation where the available cross-rack bandwidth is scarce compared to the intra-rack bandwidth; for example, ξ γ c /γ /5. The dashed line in Fig. 5 corresponds to this scenario where the system has not enough cross-rack bandwidth resources. In this practical scenario, reducing L (i.e., gathering the storage nodes into a smaller number of clusters increases capacity. However, note that sufficient dispersion of data into a fair number of clusters is typically desired, in order to guarantee the reliability of storage in rack-failure events. Finding the optimal number L of clusters in this trade-off relationship remains as an important topic for future research. In Fig. 5, the capacity is a monotonic decreasing function of L when the system suffers from an over-subscription problem. However, in general (n, k, α, γ parameter settings, capacity is not always a monotonic decreasing function of L. Theorem illustrates the behavior of capacity as L varies, focusing on the special case of γ α. Before formally stating the next main result, we need to define σ : L n I L n, (3 the dispersion factor of a clustered storage system, as illustrated in Fig. 6. In the two-dimensional representation of a clustered distributed storage system, L represents the number of rows (clusters, while n I n/l represents the number of columns (nodes in each cluster. The dispersion factor σ is the L clusters n I nodes / cluster σ L n I 3 5 Fig. 6: An example of DSS with dispersion ratio σ 5/3, when the parameters are set to L 3, n I 5, n n IL 5 ratio of the number of rows to the number of columns. If L increases for a fixed n I, then σ grows and the nodes become more dispersed into multiple clusters. Now we state our second main result, which is about the behavior of C versus L. Theorem. Consider the γ α case when σ, γ, R and ξ are fixed. In the asymptotic regime of large n, capacity C(α, β I, β c is asymptotically equivalent to C k ( γ + n k n( /L γ c, (4 a monotonically decreasing function of L. This can also be stated as C C (5 as n for a fixed σ. Note that under the setting of Theorem, we have n I n L n Θ( n, nσ k nr Θ(n, α γ constant, β I γ I n I Θ(, β c γ c Θ( n n n I n in the asymptotic regime of large n. The proof of Theorem is based on the following two lemmas. Lemma. In the case of γ α, capacity C(α, β I, β c is upper/lower bounded as where and C is defined in (4. C C(α, β I, β c C + δ (6 δ n I(β I β c /8 (7 Lemma. In the case of γ α, C and δ defined in (4, (7 satisfy when γ, R and ξ are fixed. C Θ(n, δ O(n I The proofs of these lemmas are in Appendix H. Here we provide the proof of Theorem by using Lemmas and. proof (of Theorem. From Lemma, δ/c O(/L O( /nσ 0 (8

7 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 7 as n for a fixed σ. oreover, dividing (6 by C results in C/C + δ/c. (9 Putting (8 into (9 completes the proof. IV. DISCUSSION ON FEASIBLE (α, β I, β c In the previous section, we obtained the capacity of the clustered DSS. This section analyzes the feasible (α, β I, β c points which satisfy C(α, β I, β c for a given file size. Using γ γ I + γ c (n I β I + (n n I β c in (, the behavior of the feasible set of (α, γ points can be observed. Essentially, the feasible points demonstrate a tradeoff relationship. Two extreme points minimum storage regenerating (SR point and minimum bandwidth regenerating (BR point of the trade-off has been analyzed. oreover, a special family of network codes which satisfy ɛ 0, which we call the intra-cluster repairable codes, is compared with the locally repairable codes considered in [3]. Finally, the set of feasible (α, β c points is discussed, when the system allows maximum β I. Total Repair BW ( I + c /(n-k /(n-k Storage size ( Fig. 7: Optimal tradeoff between node storage size α and total repair bandwidth γ, under the setting of n 5, k 8, L 3, 8 Cross-cluster Repair BW ( c /(n-k /(n-k Storage size ( A. Set of Feasible (α, γ Points We provide a closed-form solution for the set of feasible (α, γ points which enable reliable storage of data. Based on the range of ɛ defined in (0, the set of feasible points show different behaviors as stated in Corollary. Corollary. Consider a clustered DSS for storing data, when ɛ β c /β I satisfies 0 ɛ. For any γ γ (α, the data can be reliably stored, i.e., C, while it is impossible to reliably store data when γ < γ (α. The threshold function γ (α can be obtained as: if n k ɛ, α (0, k tα γ s (α t, α [ t++s t+y t+, t+s ty t, (t k, k,, s 0, α [ s 0, (0 Otherwise (if 0 ɛ < n k, α (0, τα τ+ k iτ+ zi τ+s τ y τ γ s τ, α [ τ+ k, iτ+ zi (α tα s t, α [ t++s t+y t+, t+s ty t, (t τ, τ,, s 0, α [ s 0, ( Fig. 8: Optimal tradeoff between node storage size α and crossrack repair bandwidth γ c, under the setting of n 5, k 8, L 3, 8 where τ max{t {0,,, k } : z t } ( { k it+ zi s t (n I +ɛ(n n I, t 0,,, k (3 0, t k y t (n I + ɛ(n n I, z t { (4 z t (n n I t + h t ɛ + (n I h t, t [k], t 0 (5 s h t min{s [n I ] : g l t}, (6 l and {g l } n I l is defined in (7. An example for the trade-off results of Corollary is illustrated in Fig. 7, for various 0 ɛ values. Here, the ɛ (i.e., β I β c case corresponds to the symmetric repair in the non-clustered scenario [0]. The plot for ɛ 0 (or β c γ c 0 shows that the cross-cluster repair bandwidth can be reduced to zero with extra resources (α or γ I, where the amount of required resources are specified in Corollary. Note that in the case of ɛ 0, the storage system is completely localized, i.e., nodes can be repaired exclusively from their

8 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 8 γγ γγ γγ mmmmmm SR point αα: Node Capacity γγ: Repair Bandwidth γγ mmmmmm BR point αα mmmmmm αα mmmmmm αα (0 γγ mmbbbb ( γγ mmbbbb ( αα mmssss (0 αα mmssss εε 0 case εε case αα Fig. 9: Set of Feasible (α, γ Points Fig. 0: Optimal trade-off curves for ɛ 0, cluster. From Fig. 7, we can confirm that as ɛ decreases, extra resources (γ or α are required to reliably store the given data. oreover, Corollary suggests a mathematically interesting result, stated in the following Theorem, the proof of which is in Appendix B. Theorem 3. A clustered DSS can reliably store file with the minimum storage overhead α /k if and only if ɛ n k. (7 Note that α /k is the minimum storage overhead which can satisfy the DS property, as stated in [0]. The implication of Theorem 3 is shown in Fig. 7. Under the k 8 setting, data can be reliably stored with minimum storage overhead α /k for n k ɛ, while it is impossible to achieve minimum storage overhead α /k for 0 ɛ < n k. Finally, since reducing the cross-cluster repair burden γ c is regarded as a much harder problem compared to reducing the intra-cluster repair burden γ I, we also plotted feasible (α, γ c pairs for various ɛ values in Fig. 8. The plot for ɛ 0 obviously has zero γ c, while ɛ > 0 cases show a trade-off relationship. As ɛ increases, the minimum γ c value increases gradually. B. inimum-bandwidth-regenerating (BR point and inimum-storage-regenerating (SR point According to Corollary, the set of feasible (α, γ points shows a trade-off curve as in Fig. 9, for arbitrary n, k, L, ɛ settings. Here we focus on two extremal points: the minimum-bandwidth-regenerating (BR point and the minimum-storage-regenerating (SR point. As originally defined in [0], we call the point on the trade-off with minimum bandwidth γ as BR. Similarly, we call the point with minimum storage α as SR. Let (α msr, (ɛ γ msr (ɛ be the (α, γ pair of the SR point for given ɛ. Similarly, define (α (ɛ mbr, γ(ɛ mbr as the parameter pair for BR points. According to Corollary, the explicit (α, γ expression for the SR and BR points are as in the following Corollary, the proof of which is given in Appendix F-B. Among multiple points with minimum γ, the point having the smallest α is called the BR point. Similarly, among points with minimum α, the point with the minimum γ is called the SR point. Corollary. For a given 0 ɛ, we have (α msr, (ɛ γ msr (ɛ τ+ k ( k, k (, iτ+ zi k τ+ iτ+ zi k iτ+ zi s τ, 0 ɛ < n k s k, n k ɛ (8 (α (ɛ mbr, γ(ɛ mbr ( s 0, s 0. (9 Now we compare the SR and BR points for two extreme cases of ɛ 0 and ɛ. Using R : k/n and the dispersion ratio σ defined in (3, the asymptotic behaviors of BR and SR points are illustrated in the following theorem, the proof of which is in Appendix C. Theorem 4. Consider the SR point (α (ɛ msr and the BR point (α (ɛ mbr, γ(ɛ mbr for ɛ 0,. The minimum node storage for ɛ 0 is asymptotically equivalent to the minimum node storage for ɛ, i.e., msr, γ (ɛ α (0 msr α ( msr /k (30 as n for arbitrary fixed σ and R k/n. oreover, the BR point for ɛ 0 approaches the BR point for ɛ, i.e., (α (0 mbr, γ(0 mbr (α( mbr, γ( mbr, (3 as R k/n. The ratio between γ (0 mbr expressed as γ (0 mbr k γ ( n mbr and γ( mbr Note that under the setting of Theorem 4, we have α ( msr constant, k nr Θ(n, Θ(n is (3 in the asymptotic regime of large n. According to Theorem 4, the minimum storage for ɛ 0 can achieve /k as n with fixed R k/n. This result coincides with the result of Theorem 3. According to Theorem 3, the sufficient and necessary condition for achieving the minimum storage of α /k is ɛ n k n( R. (33

9 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 9 c i [c i, c i ] : A symbol containing sub-symbols γ c, c, c, c 6, c, c 6, c 3, c 4, c 5 c 7, c 8, c 9 c 0, c, c c 3, c 7, c 4, c 8, c 5, c 9, c 0, c 3, c 4 c, c 3, c 5 c, c 4, c 5 (a ɛ st cluster nd cluster (0 γ mbr ( γ mbr BR (ε BR (ε 0 α γ e i [e i, e i, e i 3, e i 4, e i 5 ] : A symbol containing 5 sub-symbols e, e e, e 3 e, e 3 e 4, e 5 e 4, e 6 e 5, e 6 (b ɛ 0 st cluster nd cluster Fig. : BR coding schemes for n 6, k 5, L, α 0 [sub-symbol], γ 0 [sub-symbol] As n increases with a fixed R, the lower bound on ɛ reduces, so that in the asymptotic regime, ɛ 0 can achieve α /k. oreover, Theorem 4 states that the BR point for ɛ 0 approaches the BR point for ɛ as R k/n goes to. Fig. provides two BR coding schemes with (n, k, L (6, 5,, which has different ɛ values; one coding scheme in Fig. a satisfies ɛ, while the other in Fig. b satisfies ɛ 0. The RSKR coding scheme [] is applied to the six nodes in Fig. a. Each node (illustrated as a rectangular box contains five c i symbols, where each symbol c i consists of two sub-symbols, c ( i and c ( i. Note that any symbol c i is shared by exactly two nodes in Fig. a, which is due to the property of RSKR coding. This system can reliably store fifteen symbols {c i } 5, or 30 sub-symbols {c ( i, c ( i } 5, since it satisfies two properties the exact repair property and the data recovery property as illustrated below. First, when a node fails, five other nodes transmit five symbols (one distinct symbol by each node, which exactly regenerates the failed node. Second, we can retrieve data, 30 sub-symbols {c ( i, c ( i } 5, by contacting any k 5 nodes. In Fig. b, each node contains two e i symbols, where each symbol e i consists of five sub-symbols, {e (j i } 5 j. Note that in Fig. b, any symbol e i is shared by exactly two nodes which reside in the same cluster. This is because we applied RSKR coding at each cluster in the system of Fig. b. This system can reliably store six symbols {e i } 6, or 30 sub-symbols i [6],j [5] {e(j i }, since it satisfies the exact repair property and the data recovery property. Note that both DSSs in Fig. reliably store 30 subsymbols, by using the node capacity of α 0 sub-symbols and the repair bandwidth of γ 0 sub-symbols. However, the former system requires γ c 6 cross-cluster repair bandwidth for each node failure event, while the latter system requires γ c 0 cross-cluster repair bandwidth. For example, if the leftmost node of the st cluster fails in Fig. a, then four sub-symbols {c (i, c(i } are transmitted within that cluster, R ( α mbr (0 α mbr Fig. : Relationship between BR point of ɛ 0 and that of ɛ while six sub-symbols {c (i 3, c(i 4, c(i 5 } are transmitted from the nd cluster. In the case of ɛ 0 in Fig. b, ten subsymbols {e (i, e(i }5 are transmitted within the st cluster and no sub-symbols are transmitted across the clusters, when the leftmost node of the st cluster fails. Thus, transition from the former system (Fig. a to the latter system (Fig. b reduces the cross-cluster repair bandwidth to zero, while maintaining the storage capacity and the required resource pair (α, γ. Likewise, we can reduce the cross-cluster repair bandwidth to zero while maintaining the storage capacity, in the case of R k/n. Note that γ (ɛ mbr α(ɛ mbr for 0 ɛ, from (9. Thus, the result of (3 in Theorem 4 can be expressed as Fig. at the asymptotic regime of large n, k. According to Fig., intra-cluster only repair (ɛ 0 or β c 0 is possible by using additional resources (α and γ in the R portion, compared to the symmetric repair (ɛ case. C. Intra-cluster Repairable Codes versus Locally Repairable Codes [3] Here, we define a family of network coding schemes for clustered DSS, which we call the intra-cluster repairable codes. In Corollary, we considered DSSs with ɛ 0, which can repair any failed node by using intra-cluster communication only. The optimal (α, γ trade-off curve which satisfies ɛ 0 is illustrated as the solid line with cross markers in Fig. 7. Each point on the curve is achievable (i.e., there exists a network coding scheme, according to the result of []. We call the network coding schemes for the points on the curve of ɛ 0 the intra-cluster repairable codes, since these coding schemes can repair any failed node by using intra-cluster communication only. The relationship between the intra-cluster repairable codes and the locally repairable codes (LRC of [3] are investigated in Theorem 5, the proof of which is given in Appendix D. Note that according to the definition in [3], an (n, l 0, m 0,, α-lrc encodes a file of size into n coded symbols, where each symbol contains α bits. In addition, any coded symbol of the LRC is regenerated by accessing at most l 0 other symbols (i.e., the code has repair locality of l 0, while the minimum distance of the code is m 0. Theorem 5. The intra-cluster repairable codes with storage α

10 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 0 Cross-cluster repair bandwidth ( c Node Capacity ( Fig. 3: Optimal tradeoff between cross-cluster repair bandwidth and node capacity, when n 00, k 85, L 0, 85 and β I α overhead α are the (n, l 0, m 0,, α-lrc codes of [3] where l 0 n I, m 0 n k +. It is confirmed that Theorem of [3], m 0 n +, (34 α l 0 α holds for every intra-cluster repairable code with storage overhead α. oreover, the equality of (34 holds if D. Required β c for a given α α α (0 msr, (k mod n I 0. Here we focus on the following question: when the available intra-cluster repair bandwidth is abundant, how much crosscluster repair bandwidth is required to reliably store file? We consider scenarios when the intra-cluster repair bandwidth (per node has its maximum value, i.e., β I α. Under this setting, Theorem 6 specifies the minimum required β c which satisfies C(α, β I, β c. The proof of Theorem 6 is given in Appendix E. Theorem 6. Suppose the intra-cluster repair bandwidth is set at the maximum value, i.e., β I α. For a given node capacity α, the clustered DSS can reliably store data if and only if β c βc where (k α n k, if α [ k, f k mα k (n i, if α [ f βc im+ m+, f m (m k, k 3,, s + k 0α k ik (n i, if α [ f, 0 + k0 + k 0 0, if α [ k 0,, (35 f m m + k im+ (n i, (36 n m k 0 k k n I. (37 Fig. 3 provides an example of the optimal trade-off relationship between β c and α, explained in Theorem 6. For α /k 0.04, the cross-cluster burden β c can be reduced to zero. However, as α decreases from /k 0, the system requires a larger β c value. For example, if α β I.05 in Fig. 3, β c 0.03 is required to satisfy C(α, β I, β c 85. Thus, for each node failure event, a cross-cluster repair bandwidth of γ c (n n I β c.7 is required. Theorem 6 provides an explicit equation for the cross-cluster repair bandwidth we need to pay, in order to reduce node capacity α. V. FURTHER COENTS & FUTURE WORKS A. Explicit coding schemes for clustered DSS According to the part I proof of Theorem, there exists an information flow graph G which has the min-cut value of C, the capacity of clustered DSS. Thus, according to [], there exists a linear network coding scheme which achieves capacity C. Although the existence of a coding scheme is verified, explicit network coding schemes which achieve capacity need to be specified for implementing practical systems. Recently, under the setting of clustered DSS modeled in the present paper, BR codes for all n, k, L, ɛ are constructed in [36] and SR codes for limited parameters are designed in [37]. Explicit code construction for general parameters and/or construction of codes that requires small field size are interesting remaining issues. B. Optimal number of clusters According to Theorem, capacity C is asymptotically a monotonically decreasing function of L, the number of clusters. Thus, reducing the number of clusters (i.e., gathering storage nodes into a smaller number of clusters increases storage capacity. However, as mentioned in Section III-E, we typically want to have a sufficiently large L, to tolerate the failure of a cluster. Then, the remaining question is in finding optimal L which not only allows sufficiently large storage capacity, but also a tolerance to cluster failures. We regard this problem as a future research topic, the solution to which will provide a guidance on the strategy for distributing storage nodes into multiple clusters. C. Extension to general d I, d c settings The present paper assumed a maximum helper node setting, d I n I and d c n n I, since it maximizes the capacity as stated in Proposition. However, waiting for all helper nodes gives rise to a latency issue. If we reduce the number of helper nodes d I and d c, low latency repair would be possible, while the achievable storage capacity decreases. Thus, we consider obtaining the capacity expression for general d I, d c settings, and discover the trade-off between capacity and latency for various d I, d c values. D. Scenarios of aggregating the helper data within each cluster In recent work [8], [38] on multi-rack (multi-cluster distributed storage, the authors discuss aggregation of repair data leaving a given cluster. The ideas is to allow aggregation

11 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY and compression of all helper data leaving each cluster to aid reconstruction taking place in some other cluster containing a failed node. This type of repair link aggregation has been shown to reduce the cross-cluster repair burden in [8], [38]. We expect that the same method would also change the tradeoff picture for our distributed cluster model. This is certainly an interesting and important topic to investigate, but careful analysis including the effect of security breach on links will provide a more complete assessment of the merits and potential perils of repair link aggregation. We will leave this as a future endeavor. VI. CONCLUSION This paper considered a practical distributed storage system where storage nodes are dispersed into several clusters. Noticing that the traffic burdens of intra- and cross-cluster communications are typically different, a new system model for clustered distributed storage systems is suggested. Based on the cut-set bound analysis of information flow graph, the storage capacity C(α, β I, β c of the suggested model is obtained in a closed-form, as a function of three main resources: node storage capacity α, intra-cluster repair bandwidth β I and cross-cluster repair bandwidth β c. It is shown that the asymmetric repair (β I > β c degrades capacity, which is the cost for lifting the cross-cluster repair burden. oreover, in the asymptotic regime of a large number of storage nodes, capacity is shown to be asymptotically equivalent to a monotonic decreasing function of L, the number of clusters. Thus, reducing L (i.e., gathering nodes into less clusters is beneficial for increasing capacity, although we would typically need to guarantee sufficiently large L to tolerate rack failure events. Using the capacity expression, we obtained the feasible set of (α, β I, β c triplet which satisfies C(α, β I, β c, i.e., it is possible to reliably store file by using the resource value set (α, β I, β c. The closed-form solution on the feasible set shows a different behavior depending on ɛ β c /β I, the ratio of cross- to intra-cluster repair bandwidth. It is shown that the minimum storage of α /k is achievable if and only if ɛ n k. oreover, in the special case of ɛ 0, we can construct a reliable storage system without using cross-cluster repair bandwidth. A family of network codes which enable ɛ 0, called the intra-cluster repairable codes, has been shown to be a class of the locally repairable codes defined in [3]. APPENDIX A PROOF OF THEORE Here, we prove Theorem. First, denote the right-hand-side (RHS of (5 as g i i T : min{α, ρ i β I +(n ρ i j g m β c }. (A. j m For other notations used in this proof, refer to subsection II-C. The proof proceeds in two parts. Part I. Show an information flow graph G G and a cutset c C(G such that w(g, c T : Consider the information flow graph G illustrated in Fig. 4, which is obtained by the following procedure. First, data from source node S is distributed into n nodes labeled from x to x n. As mentioned in Section II-B, the storage node x i (x i in, xi out consists of an input-node x i in and an outputnode x i out. Second, storage node x t fails and is regenerated at the newcomer node x n+t for t [k]. The newcomer node x n+t connects to n survived nodes {x m } n+t mt+ to regenerate xt. Third, data collector node DC contacts {x n+t } k t to retrieve data. This whole process is illustrated in the information flow graph G. To specify G, here we determine the -dimensional location of the k newcomer nodes {x n+t } k t. First, consider the -dimensional structure representation of clustered distributed storage, illustrated in Fig. 5. In this figure, each row represents each cluster, and each node is represented as a - dimensional (i, j point for i [L] and j [n I ]. The symbol N(i, j denotes the node at (i, j point. Here we define the set of n nodes, N : {N(i, j : i [L], j [n I ]}. For t [k], consider selecting the newcomer node x n+t as where x n+t N(i t, j t i t min{ν [n I ] : j t t i t m g m, ν g m t}, m (A. (A.3 (A.4 (A.5 and g m used in the method is defined in (7. The location of k newcomer nodes selected by this method are illustrated in Fig. 6. oreover, for the n, L 3, k 9 case, the newcomer nodes {x n+t } k t are depicted in Fig. 7. In these figures, the node with number t inside represents the newcomer node labeled as x n+t. For the given graph G, now we consider a cut-set c C(G defined as below. The cut-set c (U, U can be defined by specifying U and U (complement of U, which partition the set of vertices in G. First, let x i in, xi out U for i [n] and x n+i out U for i [k]. For i [k], the input node x n+i in is included in either U or U, depending on the condition specified in the next paragraph. oreover, let S U and DC U. See Fig. 8. Let U 0 n {xi out}. For t [k], let ωt be the sum of capacities of edges from U 0 to x n+t in include x n+t in the cut-set c has the cut-value of in U. Otherwise, we include x n+t in w(g, c. If α ω t, then we in U. Then, min{α, ωt }. t (A.6 All that remains is to show that (A.6 is equal to the expression in (A.. In other words, we will obtain the expression for ωt. Recall that in the generation process of G, any newcomer node x n+t connects to n helper nodes {x m } mt+ n+t to regenerate x t. Among the n helper nodes, the n I nodes

12 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 𝑛𝑛 𝑥𝑥𝑖𝑖𝑖𝑖 𝛼𝛼 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜 𝛼𝛼 𝑛𝑛+ 𝑛𝑛+ 𝑥𝑥𝑖𝑖𝑖𝑖 𝛼𝛼 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜 𝑛𝑛 𝑛𝑛+ 𝑥𝑥𝑖𝑖𝑖𝑖 𝑛𝑛 𝑛𝑛 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜 𝛼𝛼 𝛽𝛽𝑘𝑘 𝛽𝛽𝑘𝑘 𝑛𝑛+ 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜 𝑛𝑛+𝑘𝑘 𝑥𝑥𝑖𝑖𝑖𝑖 𝑛𝑛 𝑘𝑘 𝛼𝛼 DC 𝑛𝑛+𝑘𝑘 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜 st cluster 𝑥𝑖𝑛 nd cluster 𝛼 𝑥𝑜𝑢𝑡 𝛼 𝑥𝑜𝑢𝑡 𝛼 𝑛+ 𝑥𝑖𝑛 𝛼 𝑛+ 𝑥𝑜𝑢𝑡 𝑛 𝑥𝑜𝑢𝑡 𝑛+ 𝑥𝑜𝑢𝑡 𝑛 𝑥𝑖𝑛 𝛼 𝑛+ 𝑥𝑖𝑛 ഥ 𝑈 Lth cluster 𝑥𝑖𝑛 S 𝑖 th cluster 𝑁(𝑖, 𝑗 ഥ 𝑈 or 𝑈 𝑈 𝑛𝐼 𝛽𝛽 Fig. 4: The information flow graph G for the part I proof of Theorem 𝑗 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜 𝑥𝑥𝑖𝑖𝑖𝑖 𝛼𝛼 S 𝑥𝑥𝑖𝑖𝑖𝑖 DC 𝑛+𝑘 𝑥𝑖𝑛 𝛼 𝑛+𝑘 𝑥𝑜𝑢𝑡 Fig. 8: The cut-set c (U, U for the graph G Fig. 5: -dim. structure representation 8 𝑥 𝑡0 𝑁(𝑖0, 𝑗0 𝑖 𝐿 clusters 𝑔 + 𝑡0 𝑖 𝑖0 𝑖 𝑡0 𝑘 + 𝑛𝐼 𝑔 𝑘 𝑔 +𝑔 𝑖𝐿 𝑗 𝑗 𝑗 𝑗0 𝑗 𝑛𝐼 𝑗 𝑚𝑜𝑑(𝑘, 𝑛𝐼 Fig. 6: The location of k newcomer nodes: xn+,, xn+k st cluster nd cluster 3 6 3rd cluster Fig. 7: The location of k newcomer nodes for n, L 3, k 9 case as in (. n The newcomer node xn+ connects to {xm out }m, all of in which are included in U0. Therefore, ω γ (ni βi + (n ni βc holds. Next, xn+ connects to n in n+ n } from U and one node x nodes {xm 0 out from U. Define out m3 n+m variable βlm as the repair bandwidth from xn+l. out to xin n+ Then, ω γ β. From (A.3, we have x N (, and xn+ N (,. Therefore, xn+ and xn+ are in different clusters, which result in β βc. Therefore, ω γ β γ βc. m n In general, xn+t in connects to n t nodes {xout }mt+ from n+m t U0, and t nodes {xout }m U. Thus, ωt for t [k] Pfrom t can be expressed as ωt γ l βlt where ( βi, if xn+l and xn+t are in the same cluster βlt βc, otherwise. Recall Fig. 6. For arbitrary newcomer node xn+t N (it, jt, the set {xn+m }t m contains it nodes which reside in the same cluster with xn+t, and t it nodes in other clusters. Therefore, ωt can be expressed as ωt γ (it βi (t it βc reside in the same cluster with xt, while the n ni nodes are in other clusters. From our system setting in Section III-A, the helper nodes in the same cluster as the failed node help by βi, while the helper nodes in other clusters help by βc. Therefore, the total repair bandwidth to regenerate any failed node is γ (ni βi + (n ni βc (A.7 where it is defined in (A.4. Combined with (A.7 and (A.5, we get ωt (ni it βi + (n ni t + it βc (ni it βi + (n ni jt ix t m gm + it βc.

13 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 3 Then, (A.6 can be expressed as w(g, c t T i min{α, (n I iβ I + (n n I j t i m g m + iβ c } (A.8 where T i {t [k] : i t i}. From the definition of i t in (A.4, we have i T i { m g m +, i m g m +,, i g m }. m Thus, j t,,, g i for t T i. Therefore, (A.8 can be expressed as w(g, c g i min{α, (n I iβ I + (n n I j j i m g m + iβ c }, which is identical to T in (A., where ρ i used in this equation is defined in (6. Therefore, the specified information flow graph G and the specified cut-set c satisfy ω(g, c k t min{α, ω t } T. Part II. Show that for every information flow graph G G and for every cut-set c C(G, the cut-value w(g, c is greater than or equal to T in (A.. In other words, G G, c C(G, we have w(g, c T. The proof is divided into sub-parts: Part II- and Part II-. Part II-. Show that G G, c C(G, we have w(g, c B(G, c where B(G, c is in (A.4: Consider an arbitrary information flow graph G G and an arbitrary cut-set c C(G of the graph G. Denote the cutset as c (U, U. Consider an output node x i out connected to DC. If x i out U, then the cut-value w(g, c is infinity, which is a trivial case for proving w(g, c B(G, c. Therefore, the k output nodes connected to DC are assumed to be in U. In other words, at least k output nodes exist in U. Note that every directed acyclic graph can be topologically sorted [34], where vertex u is followed by vertex v if there exists a directed edge from u to v. Consider labeling the topologically first k output nodes in U as vout,, vout. k Similar to the notation for a storage node x i (x i in, xi out in Section II-B, we denote the storage node which contains vout i as v i (vin i, vi out. Then, the set of ordered k tuples (v i k can be represented as V k {(v, v,, v k : v t N for t [k] v t v t for t t }. (A.9 We also define u i, the sum of capacities of edges from U to vin i. See Fig. 9. If vin U, then the cut-set c (U, U should include the edge from vin to v out, which has the edge capacity α. Otherwise (i.e., vin U, the cut-set c (U, U should include the edges from U to vin. If v in node is directly connected to the source node S, the cut-value w(g, c is infinity (trivial case for proving w(g, c B(G, c. Therefore, vin node is S U u i i v in Fig. 9: Arbitrary information flow graph G G and arbitrary cut-set c C(G assumed to be a newcomer node helped by n helper nodes. Note that all helper nodes of vin are in U, since v out is the topologically first output node in U. Thus, the cut-set c should include the edges from U to vin, where the sum of capacities of these edges are α v out ഥU i v out k v out u γ (n I β I + (n n I β c. If vin U, then the cut-set c (U, U should include the edge from vin to v out, which has the edge capacity α. Otherwise (i.e., vin U, the cut-set c (U, U should include the edges from U to vin. As we discussed in the case of vin, we can assume v in is a newcomer node helped by n helper nodes. Since vin is the topologically second node in U, it may have one helper node vout U; at least n helper nodes in U help to generate vin. Note that the total amount of data coming into vin is γ (n I β I +(n n I β c, while the amount of information coming from vout to vin, denoted β, is as follows: if v and v are in the same cluster, β β I, otherwise β β c. Recall that the cut-set should include the edges from U to vin. The sum of capacities of these edges are DC u γ β, while the equality holds if and only if v out helps v in. In a similar way, for i [k], u i can be bounded as where i ω i γ β ji, β ji { β I, β c, j u i ω i v j and v i are in the same cluster otherwise. (A.0 (A. (A. The equality in (A.0 holds if and only if v j out helps v i in for j [i ]. Thus, vout i contributes at least min{α, ω i } to the cut value, for i [k]. In summary, for arbitrary graph G G, an arbitrary cut-set c has cut-value w(g, c of at least k min{α, ω i}: w(g, c min{α, ω i }, G G, c C(G. (A.3 Note that {ω i } depends on the relative position of {v i out} k, which is determined when an arbitrary information flow graph G G and arbitrary cut-set c C(G are specified. This

14 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 4 G Locations of kk selected output nodes nd cluster Selection vector: ss [4,3,] st cluster Fig. 0: Dependency graph for variables in the proof of Part II. relationship is illustrated in Fig. 0. Therefore, we define B(G, c min{α, ω i } (A.4 for arbitrary G G and arbitrary c C(G. Combining (A.3 and (A.4 completes the proof part II-. Part II-. min G G min B(G, c R: c C(G Assume that α and k are fixed. See Fig. 0. Note that for a given graph G G and a cut-set c (U, U with c C(G, the sequence of topologically first k output nodes (vout i k in U is determined. oreover, for a given sequence (vout i k, we have a fixed (ω i k, which determines B(G, c in (A.4. Thus, min min B(G, c can be obtained G G c C(G by finding the optimal (vout i k sequence which minimizes B(G, c. It is identical to finding the optimal (v i k, the sequence of k different nodes out of n existing nodes in the system. Therefore, based on (A.4 and (A., we have min G G where min B(G, c min c C(G β ji { β I, β c, (v i k V k i min{α, γ β ji } v j and v i are in the same cluster otherwise. j (A.5 (A.6 holds as defined in (A., and V k is defined in (A.9. In order to obtain the solution for RHS of (A.5, all we need to do is to find the optimal sequence (v i k of k different nodes, which can be divided into two sub-problems: i finding the optimal way of selecting k nodes {v i } k out of n nodes, and ii finding the optimal order of selected k nodes. Note that there are ( n k selection methods and k! ordering methods. Each selection method can be assigned to a selection vector s defined in Definition, and each ordering method can be assigned to an ordering vector π defined in Definition. First, we define a selection vector s for a given {v i } k. Definition. Assume arbitrary k nodes are selected as {v i } k. Label each cluster by the number of selected nodes in a descending order. In other words, the st cluster contains a maximum number of selected nodes, and the L th cluster contains a minimum number of selected nodes. Under this setting, define the selection vector s [s, s,, s L ] where s i is the number of selected nodes in the i th cluster. Fig. shows an example of selection vector s corresponding to the selected k nodes {v i } k. From the definition of the selection vector, the set of possible selection vectors can 3 rd cluster st cluster nd cluster 3 rd cluster Fig. : Obtaining the selection vector for given k output nodes {v i out} k (n 5, k 8, L 3 An arbitrary ordering of kk nodes, for given selection vector ss [4,3,] st cluster nd cluster 3 rd cluster Corresponding ordering vector: ππ [,,, 3,,,, ] Fig. : Obtaining the ordering vector for given an arbitrary order of k output nodes (n 5, k 8, L 3 be specified as follows. S { s [s,, s L ] : 0 s i n I, s i+ s i, L s i k }. Note that even though ( n k different selections exist, the {ωi } values in (A. are only determined by the corresponding selection vector s. This is because {ω i } depends only on the relative positions of {v i } k, whether they are in the same cluster or in different clusters. Therefore, comparing the {ω i } values of all S possible selection vectors s is enough; it is not necessary to compare the {ω i } values of ( n k selection methods. Now, we define the ordering vector π for a given selection vector s. Definition. Let the locations of k nodes {v i } k be fixed with a corresponding selection vector s [s,, s L ]. Then, for arbitrary ordering of the selected k nodes, define the ordering vector π [π,, π k ] where π i is the index of the cluster which contains v i. For a given s, the ordering vector π corresponding to an arbitrary ordering of k nodes is illustrated in Fig.. In this figure (and the following figures in this paper, the number i written inside each node means that the node is v i. From the definition, an ordering vector π has s l components with value l, for all l [L]. The set of possible ordering vectors can be specified as Π(s { π [π,, π k ] : πil s l, l [L] } (A.7 Note that for given k selected nodes, there exists k! different ordering methods. However, the {ω i } values in (A. are only determined by the corresponding ordering vector π Π(s, by similar reasoning for compressing ( n k selection methods to S selection vectors. Therefore, comparing the {ω i } values of all possible ordering vectors π is enough; it is not necessary to compare the {ω i } values of all k! ordering methods. Thus, finding the optimal sequence (v i k is identical to specifying the optimal (s, π pair, which is obtained as fol-

15 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 5 lows. Recall that from the definition of π [π, π,, π k ] in Definition, π i π j holds if and only if v i and v j are in the same cluster. Therefore, β ji in (A.6 can be expressed by using π notation: { β I if π i π j β ji (π (A.8 otherwise Thus, the i j β ji term in (A.5 can be expressed as j j β c i i i β ji (π ( πjπ i β I + (i πjπ i β c. Combining (, (A.5 and (A.9, we have where min G G L(s, π min B(G, c min c C(G s S min{α, ω i (π}, j min L(s, π π Π(s (A.9 (A.0 i ω i (π γ β ji (π a i (πβ I + (n i a i (πβ c, j (A. i a i (π n I πjπ i. (A. j Therefore, the rest of the proof in part II- shows that min s S min L(s, π R (A.3 π Π(s holds. We begin by stating a property of ω i (π seen in (A.. Proposition 3. Consider a fixed selection vector s. We claim that k ω i(π is constant irrespective of the ordering vector π Π(s. Proof. Let s [s,, s L ]. For an arbitrary ordering vector π Π(s, let b i (π n i a i (π where a i (π is as given in (A.. For simplicity, we denote a i (π, b i (π and ω i (π as a i, b i and ω i, respectively. Then, (a i + b i (n i constant (const. (A.4 for fixed n, k. Note that a i k(n I i πjπ i (A.5 j from (A.. Also, from the definition of Π(s in (A.7, an arbitrary ordering vector π Π(s has s l components with value l, for all l [L]. If we define I l (π {i [k] : π i l}, then I l (π s l holds for l [L]. Then, i I l (π j i πjπ i (s l s l t0 (A.6 t. Algorithm Generate vertical ordering π v Input: s [s,, s L] Output: π v [π,, π k ] Initialization: l for i to k do if s l 0 then l end if π i l (Store the index of cluster s πi s πi (Update the remaining node info. l (l mod L + (Go to the next cluster end for Therefore, i πjπ i j L l i I l (π j i πjπ i for fixed L, s. Combining with (A.5, a i k(n I L s l l t0 L s l l t0 t const. for fixed n, k, L, s. From (A.4 and (A.7 we get b i const. Therefore, from (A., ω i ( a i β I + ( b i β c const. t const. (A.7 for every ordering vector π Π(s, if n, k, L, β I, β c and s are fixed. Now, we define a special ordering vector π v called vertical ordering vector, which is shown to be the optimal ordering vector π which minimizes L(s, π for an arbitrary selection vector s. Definition 3. For a given selection vector s S, the corresponding vertical ordering vector π v (s, or simply denoted as π v, is defined as the output of Algorithm. The vertical ordering vector π v is illustrated in Fig. 3, for a given selection vector s as an example. For s [4, 3, ], Algorithm produces the corresponding vertical ordering vector π v [,, 3,,,,, ]. Note that the order of k 8 output nodes is illustrated in Fig. 3, as the numbers inside each node. Although the vertical ordering vector π v depends on the selection vector s, we use simplified notation π v instead of π v (s. From Fig. 3, obtaining π v using Algorithm can be analyzed as follows. oving from the leftmost column to the rightmost column, the algorithm selects one node per cluster. After selecting all k nodes, π i stores the index of the cluster which contains the i th selected node. Now, the following Lemma shows that the vertical ordering vector π v is optimal in the sense of minimizing L(s, π for an arbitrary selection vector s. Lemma 3. Let s S be an arbitrary selection vector. Then, a

16 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 6 Given selection vector: ss [4,3,] Vertical ordering vector: ππ vv (ss [,, 3,,,,, ] Fig. 3: The vertical ordering vector π v for the given selection vector s [4, 3, ] (for n 5, k 8, L 3 case vertical ordering vector π v minimizes L(s, π. In other words, L(s, π v L(s, π holds for arbitrary π Π(s. Proof. In the case of β c 0, we show that L(s, π is constant for every π Π(s. From (A.0, L(s, π 3 8 min{α, a i (πβ I } (A.8 holds for β c 0. Using I l (π in (A.6, note that [k] can be partitioned into L disjoint subsets as [k] L l I l(π. Therefore, (A.8 can be written as L L(s, π min{α, a i (πβ I }. l i I l (π Recall that π Π(s contains s l components with value l for l [L]. Thus, from (A., {a i (π} {n I, n I,, n I s l } i I l (π for l [L]. Therefore, i I l (π min{α, a i(πβ I } is constant π Π(s for arbitrary l [L]. In conclusion, L(s, π in (A.8 is constant irrespective of π Π(s for β c 0. The rest of the proof deals with the β c 0 case. For a given arbitrary s S, define two subsets of Π(s as Π r {π Π(s : S t (π S t (π, t [k], π Π(s} (A.9 Π m {π Π(s : L(s, π L(s, π, π Π(s} (A.30 where S t (π t w i(π is the running sum of w i (π. Here, we call Π r the running sum maximizer and Π m the mincut minimizer. Now the proof proceeds in two steps. The first step proves that the running sum maximizer minimizes mincut, i.e., Π r Π m. The second step proves that the vertical ordering vector is a running sum maximizer, i.e., π v Π r. Step. Prove Π r Π m : Define two index sets for a given ordering vector π: Ω L (π {i [k] : w i (π α} Ω s (π {i [k] : w i (π < α} Now define a set of ordering vectors as (A.3 Π p {π Π(s : i j i Ω L (π, j Ω s (π}. (A.3 The rest of the proof is divided into sub-steps. Step -. Prove Π m Π p and Π r Π p by transposition: Consider arbitrary π [π,, π k ] Π c p. Use a shorthand notation ω i to represent ω i (π for i [k]. From (A.3, there exists i > j such that i Ω L (π and j Ω s (π. Therefore, there exists t [k ] such that t + Ω L (π and t Ω s (π hold. By (A.3, ω t+ α > ω t (A.33 for some t [k ]. Note that from (A., π t π t+ implies ω t+ ω t β I < ω t. Therefore, π t π t+ (A.34 should hold to satisfy (A.33. Define an ordering vector π [π,, π k ] as π i π i i t, t + π t π t+ (A.35 π t+ π t. Use a short-hand notation ω i to represent ω i(π for i [k]. Note that {ω i } satisfies ω i ω i i t, t + ω t ω t+ + β c (A.36 ω t+ ω t β c for the following reason. First, use simplified notations a i and a i to mean a i(π and a i (π, respectively. Then, using (A., (A.34 and (A.35, we have t a t n I π j π t n t I Similarly, n I a t+ n I j j πjπ t+ t πjπ t+ a t+. (A.37 j t π j π t+ n t I j j π j π t+ t n I πjπ t a t. (A.38 j Therefore, from (A., (A.37 and (A.38, we have ω t a tβ I + (n t a tβ c a t+ β i + (n t a t+ β c ω t+ + β c. Similarly, ω t+ ω t β c holds. This proves (A.36. Thus, from (A.33 and (A.36, L(s, π min{α, ω i } min{α, ω i } + ω t + α, i / {t,t+}

17 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 7 L(s, π min{α, ω i} i/ {t,t+} i/ {t,t+} min{α, ω i} + α + (ω t β c min{α, ω i } + α + (ω t β c. Therefore, L(s, π > L(s, π holds for β c > 0. In other words, if π Π c p, then π Π c m. This proves that Π c p Π c m holds for β c 0. Similarly, Π c p Π c r can be proved as follows. For the pre-defined ordering vectors π and π, we have S t (π t ω i + ω t and S t (π t ω i + ω t+ + β c. Using (A.33, we have S t (π < S t (π, so that π cannot be a running-sum maximizer. Therefore, Π c p Π c r holds. Step -. Prove that L(s, π L(s, π, π Π r, π Π p Π c r : Consider arbitrary π Π r and π Π p Π c r. For i [k], let ωi and ω i be short-hand notations for ω i (π and ω i (π, respectively. Note that from Proposition 3, Let Then, from (A.9, ω i ωi. t max{i [k] : w i α}, t max{i [k] : w i α} t ω i t ωi. Combining with (A.39, we obtain (A.39 (A.40 (A.4 (ω i ωi 0. (A.4 it+ Note that from the result of Step -, both π and π are in Π p. Therefore, ω i α for i [t]. Similarly, ωi α for i [t ]. Therefore, (A.0 can be expressed as L(s, π L(s, π min{α, ω i } min{α, ωi } If t t, then we have L(s, π L(s, π from (A.4. If t > t, we get L(s, π L(s, π t it + t α + t α + it+ it + ω i, ω i. (ω i ωi 0 it+ (α ω i + (A.43 (A.44 (ω i ωi. it+ ww ii Π m ππ Π(ss Π p ππ 00 Π r Fig. 4: Relationship between sets (ii, ww ii jj ss ll jj : ww ii nn II jj ββ II + nn ii nn II + jj ββ cc jj Fig. 5: Set of s lines where (i, ω i points can position From (A.40, ω i < α holds for i > t. Therefore, L(s, π L(s, π > (ω i ωi 0 it+ from (A.4. In the case of t < t, define t (ω i α and t (ω i α. From (A.43, t L(s, π α + ω i ( ω i. Similarly, from (A.44, L(s, π ωi t ωi it+ (ω i α t it+ (ω i α ii ωi where the last inequality is from (A.40. Combined with (A.39 and (A.4, we obtain L(s, π L(s, π 0. In summary, L(s, π L(s, π irrespective of the t, t values, which completes the proof for Step -. From the results of Step - and Step -, the relationship between the sets can be depicted as in Fig. 4. Consider π 0 Π r and π Π m Π c r. Then, L(s, π 0 L(s, π from the result of Step -. Based on the definition of Π m in (A.30, we can write L(s, π 0 L(s, π for every π Π(s. In other words, π 0 Π m holds for arbitrary π 0 Π r. Therefore, Π r Π m holds. Step. Prove π v Π r : For a given selection vector s [s,, s L ], consider an arbitrary ordering vector π [π,, π k ] Π(s. The

18 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 8 ww ii (ii, ww ii Horizontal selection vector: ss hh [5,3,0] Vertical ordering vector: ππ vv (ss hh [,,,,,,, ] p + jj ss jj Fig. 6: Optimal packing of k points corresponding ω i (π defined in (A. is written as w i (n I jβ I + (n i n I + jβ c ii (A.45 where j i t π tπ i. Consider a set of lines {l j } n I j, where line l j represents an equation: w i (n I jβ I + (n i n I + jβ c. Since we assume β I β c, these lines can be illustrated as in Fig. 5. For a given π, consider marking a (i, ω i point for i [k]. Note that the (i, ω i point is on line l j if and only if j i πtπ i (A.46 t where the summation term in (A.46 represents the number of occurrence of π i value in {π t } i t. For the example in Fig. 3, when s [4, 3, ] and π [,, 3,,,,, ] Π(s, line l 3 contains the point (i, ω i (6, ω 6 since 3 6 t π tπ 6. Recall that I l (π {i [k] : π i l}, as defined in (A.6, where I l (π s l holds l [L]. For j [n I ], consider l [L] with s l j. Let i 0 be the j th smallest element in I l (π. Then, j i 0 t π tl and π i0 l hold. Thus, the (i 0, ω i0 point is on line l j. Similarly, we can find p j L sl j (A.47 l points on line l j, irrespective of the ordering vector π Π(s. Note that L L p j sl j s l k, (A.48 j l j l which confirms that Fig. 5 contains k points. oreover, j [n I ], p j p j+ (A.49 holds from the definition in (A.47. In order to maximize the running sum S t (π t w i(π for every t, the optimal ordering vector packs p points in the leftmost area (i,, p, pack p points in the leftmost remaining area (i p +,, p + p, and so on. This packing method corresponds to Fig. 6. Note that from the definition of p j in (A.47 and Fig. 3, vertical ordering π v in Definition 3 first chooses p points on line l, then chooses p points on line l, and so on. Thus, π v achieves optimal packing in Fig. 6, which maximizes the running sum S t (π. Therefore, vertical ordering is a running Fig. 7: The optimal selection vector s h and the optimal ordering vector π v (for n 5, k 8, L 3 case sum maximizer, i.e., π v Π r. Combining Steps and, we conclude that π v minimizes L(s, π among π Π(s for arbitrary s S. Now, we define a special selection vector called the horizontal selection vector, which is shown to be the optimal selection vector which minimizes L(s, π v. Definition 4. The horizontal selection vector s h [s,, s L ] S is defined as: n I, i k n I s i (k mod n I, 0 i k n I + i > k n I +. The graphical illustration of the horizontal selection vector is on the left side of Fig. 7, in the case of n 5, k 8, L 3. The following Lemma states that the horizontal selection vector minimizes L(s, π v. Lemma 4. Consider applying the vertical ordering vector π v. Then, the horizontal selection vector s h minimizes the lower bound L(s, π on the min-cut. In other words, L(s h, π v L(s, π v s S. Proof. From the proof of Lemma 3, the optimal ordering vector turns out to be the vertical ordering vector, where the corresponding ω i (π v sequence is illustrated in Fig. 6. Depending on the selection vector s [s,, s L ], the number p j of points on each line l j changes. Consider an arbitrary selection vector s. Define a point vector p(s [p,, p ni ] where p j is the number of points on l j, as defined in (A.47. Similarly, define p(s h [p,, p n I ]. Using Definition 4 and (A.47, we have { p k n j I +, j (k mod n I k (A.50 n I otherwise Now, we prove t [n I ], t p j j t p j. j (A.5 The proof is divided into two steps: base case and inductive step. Base Case: We wish to prove that p p. Suppose p > p (i.e., p p. Then, l p l p (p (A.5 l where the first inequality is from (A.49. Note that if l

19 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 9 (k mod n I 0, then Otherwise, l p l l (k mod n I l (k mod n I l p l p > (p. (A.53 l p l + l l(k mod n I + ( k n I + + p l l(k mod n I + k n I > k (p. (A.54 n I l l Therefore, combining (A.5, (A.53 and (A.54 results in ni l p l < n I l p l k, which contradicts (A.48. Therefore, p p holds. Inductive Step: Assume that l 0 l p l l 0 l p l for arbitrary l o [n I ]. Now we prove that l 0+ l p l l 0+ l p l holds. Suppose not. Then, holds where p l 0+ Θ p l0+ Θ l 0 l Using (A.49 and (A.55, we have ll 0+ p l ll 0+ n I ll 0+ p l0+ (A.55 (p l p l. (A.56 ll 0+ (p l 0+ Θ (p l 0+ Θ (A.57 where equality holds for the last inequality iff l 0 n I. Using analysis similar to (A.53 and (A.54 for the base case, we can find that n I ll 0+ (p l < n I 0+ ll 0+ p l. Combining with (A.57, we get ll 0+ p l < ll 0+ p l Θ. (A.58 Equations (A.56 and (A.58 imply n I l p l < n I l p l k, which contradicts (A.48. Therefore, (A.5 holds. Now define for i [k]. Then, f i min{s [n I ] : h i min{s [n I ] : s p l i} l s p l i} l i [k], h i f i holds directly from (A.5. Note that since p i (A.59 (A.60 (A.6 in (A.50 is identical to g i in (7, h i can be written as s h i min{s [n I ] : g l i} l (A.6 Consider π v (s, the vertical ordering vector for a given selection vector s. Recall that as in Fig. 6, vertical ordering packs the leftmost p points on line l, packs the next p points on line l, and so on. Using (A.45, we can write (n I β I + (n i n I + β c, if i [p ] (n I β I + (n i n I + β c, if i p [p ] ω i. 0 β I + (n iβ c, if i n I t p t [p ni ] Therefore, using (A.59, we further write ω i (π v (s (n I f i β I + (n i n I + f i β c. Similarly, ω i (π v (s h (n I h i β I + (n i n I + h i β c. Combining (A.6, (A.63 and (A.64, we have ω i (π v (s h ω i (π v (s i,, k, s S, (A.63 (A.64 since we assume β I β c. Therefore, combining with (A.0, we conclude that L(s h, π v L(s, π v for arbitrary s S, which completes the proof of Lemma 4. From Lemmas 3 and 4, we have s S, π Π(s, L(s h, π v L(s, π. All that remains is to compute L(s h, π v and check that it is identical to (A.. From (A.0, L(s h, π v can be written as L(s h, π v min{α, ω i (π v (s h } (A.65 where ω i (π v (s h is defined in (A.64. From h i in (A.6, we have, i [g ], i g [g ] h i. (A.66. n I, i n I t g t [g ni ] If we define I m {i [k] : h i m}, then L(s h, π v in (A.65 can be expressed as L(s h, π v m i Im m i Im min{α, (n I h i β I + (n n I i + h i β c } min{α, (n I mβ I + (n n I i + mβ c } min{α, ρ m β I + (n ρ m iβ c } (A.67

20 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 0 where ρ m is defined in (6. Using (A.66, we have m m Im { g t +,, g t + g m }. t t for m [n I ]. Therefore, i I m can be represented as i m t g t + l for l [g m ]. Using this notation, (A.67 can be written as L(s h, π v g m m l min{α, ρ m β I + (n ρ m s (l m β c } (A.68 where s (l m m t g t + l. This expression reduces to (A.. This completes the proof of Part II-. Therefore, the storage capacity of clustered DSS is as stated in Theorem. APPENDIX B PROOF OF THEORE 3 We begin with introducing properties of the parameters z t and h t defined in (5 and (6. Proposition 4. h k n I, z k (n kɛ. (B. (B. Proof. Since we consider k > n I case as stated in (3, we have g i i [n I ] for {g i } n I defined in (7. Combining with (9 and (6, we can conclude that h k n I. Finally, z k (n kɛ is from (5 and (B.. First, consider the ɛ n k case. From (0, data can be reliably stored with node storage α /k if the repair bandwidth satisfies γ γ, where γ (k /k s k k (n kɛ k (n I + (n n I ɛ s k where the last equality is from (3 and (B.. Thus, α /k is achievable with finite γ, when ɛ n k. Second, we prove that it is impossible to achieve α /k for 0 ɛ < n k, in order to reliably store file. Recall that the minimum storage for 0 ɛ < α n k is τ + k iτ+ z i (B.3 from (8. From ( and (B., we have z i < for i τ +, τ +,, k. Therefore, τ + iτ+ z i < τ + (k (τ + + k holds, which result in τ + k iτ+ z > i k. (B.4 Thus, the 0 ɛ n k case has the minimum storage α greater than /k, which completes the proof of Theorem 3. APPENDIX C PROOF OF THEORE 4 First, we prove (30. To begin with, we obtain the expression of α (ɛ msr, for ɛ 0,. From (8, we obtain α msr (0 τ + k iτ+ z, (C. i α ( msr k. We further simplify the expression for α (0 msr as follows. Recall z t n I h t (C. for t [k] from (5, when ɛ 0 holds. Note that we have { 0, t k k n z t I +, t k k (C.3 n I from the following reason. First, from (6 and (C., z t 0 holds for t n I l g l + g l g ni + k k + n I l where the last equality is from (9 and (7. Similarly, we can prove that z t holds for t k k n I. From (C.3 and (, we obtain τ k k n I when ɛ 0. Combining (C., (C.3 and (C.4, we have α msr (0 k k n I. Then, using R k/n and σ L /n, α (0 msr α ( msr k k k n I nr nr RL nr nr R nσ R R R nσ /n. Thus, for arbitrary fixed R and σ, α msr (0 lim n α msr (. Therefore, α msr (0 is asymptotically equivalent to α msr. ( Second, we prove (3. Note that from (9, α (ɛ (C.4 (C.5 mbr γ(ɛ mbr holds for arbitrary 0 ɛ. Therefore, all we need to prove is γ (0 mbr γ( mbr. To begin with, we obtain the expression for γ (ɛ mbr, when ɛ

21 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 0,. For ɛ, z t in (5 is z t n t for t [k]. oreover, from (3, k s 0 z i n for ɛ. Therefore, from (9, (C.6 and (C.7, γ ( mbr s 0 (n k (n i k (n n k. (C.6 (C.7 (C.8 Now we focus on the case of ɛ 0. First, let q and r be q : k n I, r : (k mod n I, (C.9 (C.0 which represent the quotient and remainder of k/n I. Note that Then, we have r tg t (q + t + t t q n I(n I + qn I + r k. + tr+ qt q t + r(r + t r t t (C. (qn I + r + k (C. where the last equality is from (C.. From (C. and (A.66, we have z i (n I tg t n I k (qn I + r + k t (n I k (qn I + r k (C.3 where the second last equality is from (9 and (C.. oreover, using (C., we have qn I + r k qn I + r qn I r (qn I + r(n I r(n I r (qn I + r(n I k(n I (C.4 where the equality holds if and only if r 0. Furthermore, k s 0 z i (C.5 n I for ɛ 0 from (3. Combining (9, (C.3, (C.4 and (C.5 result in γ (0 mbr s 0 k. (C.6 From (C.8 and (C.6, we have γ (0 mbr γ( mbr ( (n k n k k n( R nr n( R where R k/n. Thus, for arbitrary n, γ (0 mbr γ( mbr (n k n k as R. This completes the proof of (3. Finally, (C.8 and (C.6 provides γ (0 mbr γ ( mbr n k n which completes the proof of (3. k n, APPENDIX D PROOF OF THEORE 5 Recall the definition of an (n, l 0, m 0,, α-lrc which appear right before Theorem 5. oreover, recall that the repair locality of a code is defined as the number of nodes to be contacted in the node repair process [3]. Since each cluster contains n I nodes, every node in a DSS with ɛ 0 has the repair locality of l 0 n I. (D. oreover, note that for any code with minimum distance m, the original file can be retrieved by contacting n m + coded symbols [38]. Since the present paper considers DSSs such that contacting any k nodes can retrieve the original file, we have the minimum distance of m 0 n k +. (D. Thus, the intra-cluster repairable code defined in Section IV-C is a (n, l 0, m 0,, α-lrc. Now we show that (34 holds. Note that from Fig. 0, we obtain α α msr (0 for ɛ 0, where α msr (0 k k n I (D.3 holds according to (C.. Thus, (34 is proven by showing m 0 n +. (D.4 α msr (0 l 0 α msr (0 By plugging (D., (D. and (D.3 into (D.4, we have k k k n n k + n k I + n I n I k k k n n k + I +. n I n I Therefore, all we need to prove is k k k n 0 I +, (D.5 n I n I which is proved as follows. Using q and r defined in (C.9 and (C.0, the right-hand-side (RHS of (D.5 is qni + r q RHS q + n I { q (q + + 0, r 0 q q +, r 0 Thus, (D.5 holds, where the equality condition is r 0, or equivalently (k mod n I 0. Therefore, (34 holds, where the equality condition is α α msr (0 and (k mod n I 0. This completes the proof of Theorem 5.

22 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY APPENDIX E PROOF OF THEORE 6 According to (A.0 and (A.68 in Appendix B, the capacity can be expressed as C min{α, ω i (π v (s h }. (E. TT kk TT kk0 + TT kk0 + CC Using (A.64 and (A.66, ω i (π v (s h in (E., or simply ω i has the following property: { ω i β I, i I G ω i+ (E. ω i β c, i [k] \ I G TT kk0 αα nn kk 0 ββ cc αα nn kk 0 Fig. 8: Capacity as a function of β c αα nn kk ββ cc where I G {g m } n I m. Note that g n I k n I from (7. Therefore, k 0 in (37 can be expressed as k 0 k k n I k g ni k (E.3 where the last inequality is from (3. Combining (6 and (E.3 result in h k0 n I. (E.4 From (A.64, (E.3 and (E.4, we have ω k0 (n I h k0 β I + (n k 0 n I + h k0 β c β I + (n k 0 β c β I + (n kβ c β I α. (E.5 where the last equality holds due to the assumption of β I α in the setting of Theorem 6. Since (ω i k is a decreasing sequence from (E., the result of (E.5 implies that ω i α, i k 0. (E.6 Thus, the capacity expression in (E. can be expressed as k 0 α + k ik ω 0+ i, ω k0+ α mα + k im+ C ω i, ω m+ α < ω m (k 0 + m k kα, α < ω k (E.7 Note that from (A.64 and (A.66, we have ω i (n iβ c for i k 0 +, k 0 +,, k. Therefore, C in (E.7 is k 0 α + k ik (n iβ α 0+ c, 0 β c n k 0 mα + k im+ C (n iβ α c, n m < β α c n m (k 0 + m k α kα, n k < β c, (E.8 which is illustrated as a piecewise linear function of β c in Fig. 8. Based on (E.8, the sequence (T m k mk 0 in this figure has the following expression: k 0 α, m k 0 k T m im+ (m + (n i n m α, k 0 + m k kα, m k (E.9 From Fig. 8, we can conclude that C holds if and only if β c βc where 0, [0, T k0 ] βc mα k (n i, (T m, T m+ ] im+ (m k 0, k 0 +,, k, (T k,. (E.0 Using T m in (E.9 and f m in (36, (E.0 reduces to (35, which completes the proof. A. Proof of Corollary APPENDIX F PROOFS OF COROLLARIES From the proof of Theorem, the capacity expression is equal to (A.67, which is C min{α, (n I h i β I + (n n I i + h i β c }. where h i is defined in (A.60. Using ɛ β c /β I and (, this can be rewritten as C min{α, (n n I i + h i ɛ + (n I h i γ}. (F. (n n I ɛ + (n I Using {z t } and {y t } defined in (5 and (4, the capacity expression reduces to C(α, γ min{α, γ }, y t t which is a continuous function of γ. (F. Remark. {z t } in (5 is a decreasing sequence of t. oreover, {y t } in (4 is an increasing sequence. Proof. Note that from (A.6, { h t +, h t+ h t, t T t [k ] \ T where T {g, g + g,, n I m g m}. Therefore, {z t } in (5 is a decreasing function of t, which implies that {y t } is an increasing sequence.

TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 3 CC(αα, γγ EE Invalid Regime EE ττ EE Invalid Regime EE Fig. 9: Capacity as a function of γ, for n k ɛ yy αα yy αα γγ αα yy ττ αα Fig.

23 TO APPEAR AT IEEE TRANSACTIONS ON INFORATION THEORY 3 CC(αα, γγ EE Invalid Regime EE ττ EE Invalid Regime EE Fig. 9: Capacity as a function of γ, for n k ɛ yy αα yy αα γγ αα yy ττ αα Fig. 30: Capacity as a function of γ, for 0 ɛ < n k case γγ γγ oreover, note that β I α holds from the definition of β I and α in Table I. Thus, combined with ɛ β I /β c, it is shown that γ in ( is lower-bounded as γ (n I β I + (n n I β c {n I + (n n I ɛ}β I {n I + (n n I ɛ}α. Here, we define γ {n I + (n n I ɛ}α. Then, the valid region of γ is expressed as γ γ, as illustrated in Figs. 9 and 30. The rest of the proof depends on the range of ɛ values; we first consider the n k ɛ case, and then consider the 0 ɛ < n k case. If n k ɛ : Using (B., z k (n kɛ holds. Combining with (4, we have y k n I + ɛ(n n I, or equivalently, y k α γ. If y t α < γ y t+ α for some t [k ], then (F. can be expressed as C(α, γ tα + γ y mt+ m tα + γ ( k mt+ z m (n I + ɛ(n n I tα + s tγ where {s t } is defined in (3. If 0 γ y α, then C(α, γ m γ γ ( k m z m y m (n I + ɛ(n n I s 0γ. If y k α < γ γ, then C(α, γ k m α kα. In summary, capacity is s 0 γ, 0 γ y α tα + s t γ, y t α < γ y t+ α C(α, γ (F.3 (t,,, k kα, y k α < γ γ which is illustrated in Fig. 9. Since {z t } is a decreasing sequence from Remark, we have z t z k (n kɛ > 0 for t [k]. Thus, {s t } k t defined in (3 is a monotonically decreasing, non-negative sequence. This implies that the curve in Fig. 9 is a monotonic increasing function of γ. From Fig. 9, it is shown that C(α, γ holds if and only if γ γ (α. From (F.3, the threshold value γ (α can be expressed as s 0, [0, E ] γ tα (α s t, (E t, E t+ ] (t,,, k, (E k,. where E t C(α, y t α (t + s t y t α (F.4 (F.5 for t [k]. The threshold value γ (α in (F.4 can be expressed as (0, which completes the proof. Otherwise (if 0 ɛ < n k : Using (B., z k (n kɛ < (F.6 holds. Since {z t } is a decreasing sequence from Remark, there exists τ {0,,, k } such that z τ+ < z τ holds, or equivalently, y τ α γ < y τ+ α. Using the analysis similar to the n k ɛ case, we obtain s 0 γ, 0 γ y α tα + s t γ, y t α < γ y t+ α C(α, γ (F.7 (t,,, τ τα + s τ γ, y τ α < γ γ which is illustrated in Fig. 30. From Fig. 30, it is shown that C(α, γ holds if and only if γ γ (α. From (F.7, the threshold value γ (α can be expressed as s 0, [0, E ] γ (α tα s t, (E t, E t+ ] (t,,, τ τα s τ, (E τ, E], (E,. where {E t } is defined in (F.5, and E C(α, γ τα + s τ α{n I + (n n I ɛ}. (F.8 The threshold value γ (α in (F.8 can be expressed as (, which completes the proof.

Capacity of Clustered Distributed Storage

Capacity of Clustered Distributed Storage Jy-yong Sohn, Beongjun Choi, Sung Whan Yoon, and Jaekyun Moon School of Electrical Engineering Korea Advanced Institute of Science and Technology Daejeon,, Republic