SPREAD: An Adaptive Scheme for Redundant and Fair Storage in Dynamic Heterogeneous Storage Systems

SPREAD: An Adaptive Scheme for Redundant and Fair Storage in Dynamic Heterogeneou Storage Sytem Mario Mene Chritian Scheideler Abtract In thi paper we tudy the problem of deigning an adaptive hah table for redundant data torage in a ytem of torage device with arbitrary capacitie. Ideally, uch a hah table hould make ure that (a) a torage device with x% of the available capacity hould get x% of the data, (b) the copie of each data item are ditributed among the torage device o that no two copie are tored at the ame device, and (c) only a near-minimum amount of data replacement i neceary to preerve (a) and (b) under any change in the ytem. Hah table atifying (a) and (c) are already known, and it i not difficult to contruct hah table atifying (a) and (b). However, no hah table i known o far that can atify all three propertie a long a thi i in principle poible. We preent a trategy called SPREAD that olve thi problem for the firt time. A long a (a) and (b) can in principle be atified, SPREAD preerve (a) for every torage device within a (1 ± ɛ) factor, with high probability, where ɛ > 0 can be made arbitrarily mall, guarantee (b) for every data item, and only need a contant factor more data replacement than minimum poible in order to preerve (a) and (b). 1 Introduction In thi paper, we conider the problem of deigning a hah table for a ytem of heterogeneou torage device o that the following condition are met: 1. Fairne: Every torage device i aigned a fair hare of the total data load, or more preciely, a torage device with x% of the available capacity hould get x% of the data. Beide the data, the fairne condition hould alo hold for the read/write requet. 2. Efficiency: The total pace for toring control information of the hah table hould only depend on the number of torage device and not on their Heinz Nixdorf Intitut, Univerity of Paderborn, Germany. Email: vodiek@upb.de Intitut für Informatik, Techniche Univerität München, Germany. Email: cheideler@in.tum.de. difference in torage capacity, and the time for computing the poition of a copy of any data item hould be at mot logarithmic in the ytem ize. 3. Redundancy: The copie of each data item are ditributed among the torage device o that no two copie are tored on the ame device. 4. Adaptivity: The ytem hould be able to preerve the condition above (a long a thi i in principle poible) under any change in the ytem (including change in the number or ize of the torage device or in the number of data item) with a near-minimum amount of data movement. Adaptivity i meaured by applying competitive analyi. That i, for an operation ω repreenting any change in the ytem, we intend to compare the number of (re-)placement of copie performed by the given placement cheme with the number of (re-)placement of copie performed by an optimal trategy that enure that, after every operation, the fairne and redundancy condition are atified. A torage trategy i called c-adaptive concerning operation ω if, on expectation, it require the (re-)placement of at mot c time the number of copie an optimal trategy would need for ω. Hah table with thee propertie have many intereting application. Conider, for example, a torage ytem that conit of a large collection of dik. Over the time, old dik have to be replaced by new dik or new dik have to be added in order to provide ufficient capacity to the uer. In thi cae, it would be deirable to jut buy any dik available on the market and plug it into the ytem intead of making ure that it ha the ame capacity or performance a ome other dik in the ytem (o that tandard triping trategie uch a thoe ued in the RAID tandard can be applied for redundant data torage). In thi cae, torage trategie are needed that are fair o that capacity or performance difference among the dik can be taken into account. Alo redundancy i important for dik ytem ince dik can fail, and adaptivity enure that the work for adding or removing dik can be kept mall o that the ytem ha a high availability. The adaptivity property

can alo be ued to take full control of the overhead of integrating a new dik into the ytem by controlling the peed at which the dik capacity declared to the ytem grow from 0 to it actual capacity without ever being in an intermediate or unafe tate. Another nice application would be ditributed web cache (like the Akamai ytem) or peer-to-peer ytem. Even though in theory, many deign for peer-to-peer hah table have focued on ditributing the data load evenly among the peer, in practice thi will not be the bet olution imply becaue the peer differ in reliability, bandwidth and the torage capacity they can or are willing to offer to the peer-to-peer ytem. 1.1 Previou work. Before we preent a hah table that can atify all of the condition above, we preent previou work in thi area and dicu why it i o difficult to adapt thee o that all condition are met. Uniform Capacitie. If all torage device have the ame capacity, it i not difficult to atify all condition above if the only change allowed in the ytem are to add or remove torage device, repectively data item. A prominent trategy here i conitent hahing [8] which (in it baic form) ue two hah function, g : U [0, 1) and h : V [0, 1), where U i the addre pace of the data item and V i the addre pace of the torage device, or node. Every data item d U i tored at the node v with h(v) being cloet from above to g(d) (where [0, 1) i een a a modulo 1 ring). Other adaptive hah table method for uniform torage device have been preented in [3, 15], but like conitent hahing, thee are hard to extend to the nonuniform cae. Non-Uniform Capacitie. The firt adaptive hah table that can handle torage device of arbitrary capacitie were introduced in [4], called SHARE and SIEVE. SHARE, for example, ue two tage to reduce the problem of managing non-uniform node to uniform node o that conitent hahing can be applied. Let n be the number of node in the ytem and let c 1,..., c n [0, 1] denote their relative capacitie, i.e., i c i = 1. For a hah table to be adaptive, it ha to be able to handle any capacity change from (c 1,..., c n ) to (c 1,..., c n). Certainly, any torage trategy that want to preerve fairne ha to replace at leat a (c i c i) = 1 c i c 2 i i:c i >c i fraction of the data in the ytem. Hence, the following fact i true, which ha been ued to bound the adaptivity of SHARE and SIEVE. i Fact 1.1. If, for any change from (c 1,..., c n ) to (c 1,..., c n), a torage trategy only need to replace a d i c i c i -fraction of the data in the ytem, then it i 2d-adaptive w.r.t. capacity change. For any capacity ditribution, SHARE and SIEVE are efficient, 2-adaptive and fair within a (1 ± ɛ) factor, where the contant ɛ > 0 can be made arbitrarily mall [4]. However, none of the two trategie can alo be made redundant under all capacity ditribution that, in principle, allow a fair ditribution of the data. Even the cae of 2 copie for each data item i hard. Conider the following imple example. We have three node with capacitie c 1 = 1/2, c 2 = 1/4 and c 3 = 1/4, and each data item i uppoed to have 2 copie tored at different node. For a fair ditribution of the copie, we mut retrict ourelve to the combination (node 1, node 2) and (node 1, node 3) for the 2 copie of a data item. If we alo allow combination (node 2, node 3), fairne cannot be achieved any more. Hence, in order to achieve fairne and redundancy at the ame time, we have to fight with the problem of forbidding certain combination (or more generally, to carefully elect node combination and give them appropriate weight). SHARE and SIEVE do not offer an eay way of doing thi. In [16], a related trategy, called DHHT, wa introduced but thi trategy cannot handle redundancy and fairne a well. Other Approache. In [11], Litwin, et al. decribe LH* cheme, a cla of calable ditributed data tructure (SDDS) which are derived from linear hahing [5]. Several LH* variant incorporate fault-tolerance feature, uch a mirroring, checkum, or Reed-Solomon code (ee e.g. [1, 9, 10, 12]). Depending on the overflow policy, pace utilization i more or le fine, but the LH* variant provide no mechanim to cope with non-uniform dik capacitie. In [6, 7], Honicky and Miller introduce a family of peudo-random algorithm that provide a decentralized mapping of replicated object to a collection of dik that can be grouped into cluter of homogeneou dik. Thee cluter are aumed to be of ufficient ize o that the copie of each data item aigned to a cluter can be tored in different dik. In thi cae, redundancy and fairne can be achieved, but there i no obviou way of extending their cheme to dik of arbitrary capacitie o that thee propertie are till true. Recently, Brinkmann et al. [2] propoed the firt cheme for ytem of dik with arbitrary capacitie that i fair and redundant, but it i only Θ(r 2 )-adaptive in the wort cae for r copie per data item wherea we are aiming at a contant adaptivity that i independent of r.

1.2 Our contribution. In thi paper, we introduce a torage trategy named SPREAD with the following performance. Theorem 1.1. Conider the problem of toring r copie for each data item in a fair and redundant way in a ytem of heterogeneou torage device. SPREAD olve thi problem in an efficient way while being O(1)- adaptive for any capacity change in the ytem a long a at any time, c i 1/r for every node i. Note that the condition that c i 1/r for every node i i neceary for being redundant and fair, and therefore SPREAD work in any cae in which redundancy and fairne can, in principle, be achieved. The ret of the paper i devoted to the proof of the theorem. 2 The SPREAD Strategy In thi ection, we decribe and analyze the SPREAD trategy. We firt decribe the baic framework of SPREAD, leaving out variou detail about how to achieve fairne, redundancy and adaptivity. Afterward, we bound the time- and pace-efficiency of SPREAD (Section 2.2), decribe how to achieve fairne and redundancy (Section 2.3), and how how to make SPREAD O(1)-adaptive (Section 2.4 and 2.5). At the end, we dicu how to extend SPREAD to the cae that the data item have different level of redundancy. 2.1 The baic cheme. In the following, V repreent the et of all torage ytem (or node) identifier and U repreent the et of all data identifier. The parameter r repreent the redundancy required for the data item in the ytem. For any hitory of change in the ytem up to ome given time point, let n be the maximum number of node that have been in the ytem at the ame time. When uing the trategy that every node newly entering the ytem obtain the lowet available identifier 1, we can make ure that the node will be numbered in a range from 1 to n. Let c 1,..., c n [0, 1] repreent the relative capacitie of the node at ome given time point (i.e., i c i = 1). (Note that the capacity of any currently unoccupied identifier i equal to 0, but it will neverthele be part of the SPREAD data tructure.) SPREAD need three (peudo-)random hah function: a hah function h : V [0, 1) for the node and two hah function g 1, g 2 : U [0, 1) for the data item. The interval [0, 1) will be conidered a a modulo 1 ring. Like in SHARE [4], SPREAD make ue of a tretch factor = σ log N where N = V (rep. the maximum number of node the ytem can expect) and σ i a ufficiently large contant that i choen o that N. For each node v we identify an interval I(v) = [h(v), h(v) + r c i mod 1) of length r c i in the [0, 1)-ring. h(v) i called it tarting point and h(v) + r c i mod 1 it endpoint. If r c i > 1, we conider I(v) to be wrapped around [0, 1) r c i many time. The SPREAD data tructure maintain a partition of the [0, 1) interval into n frame F 1,..., F n where F v tart at h(v) and end at the cloet ucceor of h(v) among the point {h(1),..., h(n)} \ {h(v)} (where [0, 1) i treated a a ring). Each frame F v i further partitioned into ubframe. We will explain later how thee ubframe are choen. For now, we jut mention that the ubframe decompoition only depend on n and h(1),..., h(n) and not on the capacitie of the node, and a n grow, ome of thee ubframe may be partitioned into maller ubframe. For each ubframe f, SPREAD aim at maintaining one (and ometime two, a will be explained later) (/α) r-table T f of r /α lot which are organized into /α group (or column) of ize r each, where α > 0 i a mall contant that i elected o that a certain degree of fairne i maintained. Every lot i aigned to (rep. owned by) exactly one node, and the aignment ha to be choen o that for each group of r lot, each lot i aigned to a different node. Then we can ue the following trategy in order to enure redundancy. Whenever we want to read or overwrite the copie of ome data item d, we perform the following tep. Firt, we identify the unique ubframe f with g 1 (d) f. Then we pick group (/α) g 2 (d) in T f and either read or tore the copie of d in the r node owning the lot in thi group. Given that the hah function can be evaluated in contant time and there are m ubframe in the ytem, tandard data tructure uch a earch tree and array can be ued to obtain the following reult. Lemma 2.1. The lookup and inert operation of SPREAD can be implemented with runtime O(r+log m) and pace O(m(r /α) log n). Hence, a long a m i cloe to linear in n, SPREAD i time- and pace-efficient. In the following ubection, we how that there i a way of maintaining the table o that alo the fairne, redundancy and adaptivity condition are atified. 2.2 Subframe management. Ideally, we would like to have the following ubframe decompoition for each frame F v. For the ubframe f 0, f 1, f 2,... following F v in the [0, 1)-ring (in acending direction) it hold that f i = ɛ(1 + ɛ) i F v for ome fixed ɛ > 0. If thi were true, we could approximate any interval I(v) with I(v) F v by an interval I (v) ending at a tarting

point of ome ubframe f i o that I (v) i within (1 ± ɛ) I(v). In fact, the following lemma hold. Lemma 2.2. Suppoe that k i the maximum value o that F v + k 1 i=0 f i I(v). Then F v + k 1 i=0 f i (1 ɛ) I(v) and F v + k i=0 f i (1 + ɛ) I(v). Proof. It hold that F v + k 1 i=0 f i = (1 + ɛ k 1 i=0 (1 + ɛ) i ) F v = (1 + ɛ) k F v for any k 0. Hence, for F v + k 1 i=0 f i I(v) F v + k i=0 f i we get that F v + k 1 i=0 f i 1/(1+ɛ) I(v) = (1 ɛ/(1+ɛ)) I(v) (1 ɛ) I(v) and F v + k i=0 f i (1 + ɛ) I(v). Thu, we can either decide to round I(v) down to the tarting point of the ubframe containing it endpoint or up to the tarting point of the next ubframe in order to obtain an interval I (v) whoe ize i within (1 ± ɛ) I(v). If I(v) i at mot F v, we identify I (v) with I(v), i.e., we do not round I(v). Of coure, chooing a ubframe decompoition o that there are ubframe f 0, f 1, f 2,... after each F v of ize f i = ɛ(1+ɛ) i F v i not poible, but it i ufficient to cut each F w into ubframe o that the following condition i true for every F v : Subframe condition: For any two frame or ubframe f and f, let (f, f ) be the ditance (in acending direction along the [0, 1)-ring) of the endpoint of f to the tarting endpoint of f. For every w {1,..., n} \ {v} and every ubframe f in F v we require that f ɛ(1 + ɛ) k F w where k N 0 i maximum poible o that ɛ k 1 i=1 (1 + ɛ)i F w (F w, f). If thi condition i true, it i eay to ee that we can till round I(v) down to the tarting point of the ubframe containing it endpoint or up to the tarting point of the next ubframe in order to obtain an interval I (v) whoe ize i within (1 ± ɛ) I(v). In order to atify the ubframe condition, we ue the following decompoition trategy. Given h(1),..., h(n), we tart with a ingle ubframe repreenting the entire [0, 1) interval, and we keep cutting ubframe in half until the ubframe condition i true everywhere (when conidering each ubframe croing b 1 frame border a being cut into b + 1 ubframe at thee border). Thi lead to a unique decompoition into ubframe (for any given h(1),..., h(n)) with the property that for every ubframe f that i not the firt of lat ubframe in ome F v, f = 1/2 k for ome k N 0. Moreover, whenever n i increaed, the decompoition trategy will only caue ome ubframe to be cut into maller ubframe, which turn out to be ueful for the adaptivity of SPREAD. The following two lemma how that, w.h.p., our decompoition rule doe not create too many ubframe. Lemma 2.3. The number of frame F v in the ytem with F v δ/n i bounded by 2δn+O( n), w.h.p., and the mallet ize of a frame F v i at leat 1/n k for ome contant k, w.h.p. 1 Proof. For every node v let the random variable X v be equal to h(v). Given a fixed δ, let the function f(x 1,..., X n ) repreent the number of frame of ize at mot δ/n. It certainly hold that f(x 1,..., x n ) f(x 1,..., x n) 2 whenever (x 1,..., x n ) and (x 1,..., x n) differ in at mot one coordinate. Moreover, it hold for any v and δ 1 that Pr[ F ( v δ/n] = 1 (1 δ/n) n 1 1 (1 (n 1)δ/n+ n 1 ) (δ/n) 2 ) = (n 1)δ/n ( n 1 ) 2 2 (δ/n) 2 δ, which implie that E[f] δn. Hence, it follow from the method of bounded difference (e.g., [13]) that for any d 0, Pr[f δn + d] e d2 /(2n). Chooing d = δn + Θ( n) reult in the firt part of the lemma. The econd part hold becaue for any v, Pr[ F v 1/n k ] 1/n k 1, and therefore the probability that there exit a v with F v 1/n k i polynomially mall in n. Lemma 2.4. The number of ubframe in the ytem i bounded by O(n/ɛ), w.h.p. Proof. For any i N 0, let S i be the et of all frame with a ize in [2 i /n, 2 (i+1) /n). For each frame F S i, we need at mot k = log 1+ɛ (1/n F ) log 1+ɛ 2 i+1 = O(i/ɛ) ubframe f 0, f 1,..., f k in the ideal etting until f k = ɛ(1 + ɛ) k F ɛ/n. Hence, it follow from Lemma 2.3 that the total number of ubframe needed i bounded by log n i=0 (2n/2i ) O(i/ɛ)+O( n) O(log n/ɛ)+ O(n/ɛ) = O(n/ɛ), w.h.p. Next, we explain how to manage the table. 2.3 Table management. For each ubframe f, C(f) repreent a multiet of node w with f I(w). The number of time a node w occur in C(f) i called it multiplicity in C(f) and denoted a µ f (w). Initially, µ f (w) i et to the number of time I(w) croe the tarting point of f, which i in { r c i, r c i }. Hence, given that c i 1/r for every i (which i neceary to maintain fairne), it hold that µ f (w) {0,..., }. Afterward, we are uing the following rule for every ubframe f F v and w v: Rounding condition: 1. Whenever the endpoint of I(w) move beyond the endpoint of f for the firt time after croing the tarting point of f, µ f (w) i increaed by 1. 1 In the following, w.h.p. or with high probability mean a probability of at leat 1 1/n c for any contant c > 0.

2. Whenever the endpoint of I(w) move below the tarting point of f for the firt time after croing the endpoint of f (or after w ha been introduced to the ytem), µ f (w) i decreaed by 1. The ame rule are alo applied if w = v and I(v) > F v. For the pecial cae that w = v and I(v) F v, µ f (w) i defined a the number of time I(w) croe the tarting point of f. With thee rule it hold that whenever the endpoint of I(w) i outide of f then µ f (w) i equal to the number of time I(w) croe f, and otherwie µ f (w) { r c i, r c i }. Hence, together with the fact that no interval can tart inide of f, the number of time interval cro f at the left and right endpoint i an upper and lower bound on C(f). Thi allow u to prove the following property. Lemma 2.5. For every ubframe f, C(f) i within (1 ± β)r, w.h.p., where the contant β > 0 can be made arbitrarily mall depending on the contant σ in. Proof. The argument above imply that, in order to bound C(f) for any f, it uffice to conider any point x [0, 1) repreenting a tarting point or endpoint of ome interval I(v). We firt conider the tarting point of an interval I(v). Conider ome fixed node v with tarting point h(v). For any node w in the ytem (including v), let the random variable X w be defined a X w = r c w + Y w where the binary random variable Y w i 1 if and only if I(w) contain h(v) r c w many time. Furthermore, let X = w X w and Y = w Y w. It certainly hold that E[X w ] = r c w for all w v and that X = r c v + w v X w. Hence, E[X] = r c v + r (1 c v ) [ r, r + 1]. Moreover, E[Y ] E[X], and ince the tarting point of the interval are choen independently at random and the Y w are binary random variable, it follow from the Chernoff bound (e.g., [14]) that Pr[ Y E[Y ] βe[y ]] e β2 E[Y ]/(2(1+β/3)) for any β > 0. Hence, X i within (1 ± β) r, w.h.p., where the contant β > 0 can be made arbitrarily mall depending on the contant σ in. For the endpoint of an interval I(v), it i eay to ee that E[X] [ r 1, r]. Hence, the ame deviation bound can alo be hown here. For W = {1,..., n} let W l be the et of node v W with c v 1/(( 1)r) and W = W \ W l. Node v W l are guaranteed to have µ f (v) 1 in every ubframe f. For any ubframe f, let C l (f) = C(f) Wl (i.e., the multiet containing only thoe node in C(f) that are alo in W l ) and C (f) = C(f) W. Let a l = v W l c v and a = 1 a l. A more refined verion of Lemma 2.5 i a follow. Lemma 2.6. For any ubframe f and any ubet W l of W l, C l (f) i within r a l ± β W l + O(log n), w.h.p., where C l (f) = C l(f) W l and a l = v W c l v. Furthermore, C (f) i within (1±β) r a +O(log n), w.h.p. In both cae, the contant β > 0 can be made arbitrarily mall depending on the contant σ in. Proof. Recall the definition of X w and Y w in the proof of the previou lemma. Applied to node w W l it hold that E[X] = w W r c l w +E[Y ] [ r a l 1, r a l + 1]. Since E[Y ] W l, the firt bound follow from the fact that Pr[ Y E[Y ] βe[y ]] e β2 E[Y ]/(2(1+β/3)) for any β > 0. The econd bound follow along the ame line a in the proof of Lemma 2.5. Furthermore, the following reult hold, which i crucial for SPREAD to be redundant. Lemma 2.7. For every ubframe f there are at leat r different node in C(f), w.h.p., where the probability depend on the contant σ in the tretch factor. Proof. Let q = W l. If q r, then the lemma i trivially true. So uppoe that q < r. Since c v 1/r for all v, a relative capacity of at leat 1 q/r mut be covered by W. Hence, W (1 q/r)/(1/(( 1)r)) = ( 1)(r q) and v W I(v) (r )(1 q/r) = (r q). Suppoe that the node in W are numbered from 1 to t. Conider any fixed node v W and focu on the point y implied by v interval I(v) = [x, y). For any node w W let the binary random variable X w be 1 if and only if y I (w) and let X = w X w. Since E[X w ] = Pr[X w = 1] = min{ I (w), 1} for all w W \ {v} it hold that E[X] = q + E[X w ] = q + min{ I (w), 1} w W \{v} w W \{v} q + (1 ɛ)(r q) 1 r + (1 ɛ) 2 Since = Θ(log N) and the tarting point of the interval are choen uniformly and independently at random, it follow from the Chernoff bound that X r for point y, w.h.p. Since we only need to conider thoe point in [0, 1) that are tarting point or endpoint of interval I(v) in order to cover all ubframe in [0, 1) and the lowet value for E[X] are reached when focuing on endpoint of interval I(v) of node v W, the lemma follow. Recall hat for each ubframe f, we maintain one (and ometime two) (/α) r-table T f of r /α lot which are organized into /α group (or column) of ize

r each. We aume that α > 0 i a ufficiently mall contant with 1/α N. Our goal i to aign the lot of T f to node in f o that the following condition are met: Table condition: 1. Every lot i aigned to (rep. owned by) exactly one node in C f. 2. Every node v in C(f) own within (1 ± γ)µ f (v)/α many lot but at mot /α many lot in T f for ome contant 0 < γ < 1 to be pecified below. 3. Every group conit of lot belonging to different node. The γ that i ufficient to maintain the table condition i given in the following lemma. In thi lemma, α i the parameter ued in the table ize and β i the parameter ued in Lemma 2.5 and 2.6. Lemma 2.8. If the bound in Lemma 2.6 are true and α and β are ufficiently mall, then condition 1 and 2 can be met with any γ β/(1 β) + α. Proof. Recall the bound in Lemma 2.6 and ignore the O(log n) term for a moment. We conider the node in W l and W eparately, tarting with W l. Let W h W l be the et of all node v with c v large enough o that it i guaranteed that µ f (v) 1/γ for γ a choen in the lemma. In thi cae, every v W h atifie (1 γ)µ f (v)/α (µ f (v) 1)/α r c v /α and (1+γ)µ f (v)/α (µ f (v)+1)/α r c v /α. Moreover, r c v /α /α. Hence, each v W h can be given either r c v /α or r c v /α many lot without violating condition 1. Thi implie that the node in W h can be given lot o that the total number of lot ued by the node in W h i in [ r a h /α, r a h /α ] where a h = v W h c v. Thi i perfect up to an additive 1. Next, conider the node in W l = W l \ W h. Let C l (f) = C l(f) W l be the multiet of node v C l (f) that are not in W h. In thi cae, (1 + γ)µ f (v)/α < /α (given that i ufficiently large), o we do not have to worry about limiting the number of lot of v by /α. Let a l = v W c l v. Suppoe for an upper bound on the number of lot per node that C l (f) = r a l β W l (ee Lemma 2.6). If φ > 0 i choen o that v W [µ l f (v)/α + φ/α] r a l /α, then it uffice to aign at mot (µ f (v) + φ)/α + 1 lot to every node v C l (f) to cover at leat an a l-fraction of the lot in f. Since v W µ l f (v) = C l (f), it hold that v W [µ l f (v)/α + φ/α] r a l /α if and only if C l (f) r a l φ W l, o we have to chooe φ β for thi. For a lower bound, uppoe that C l (f) = r a l + β W l. If φ > 0 i choen o that v W [µ l f (v)/α φ/α] r a l /α, then at leat (µ f (v) φ)/α 1 lot can be aigned to every node v C l (f) to cover at mot an a l-fraction of the lot in f. For thi we have to chooe φ o that φ β. Together with the fact that µ f (v) 1 for all v C l (f) it follow that γ = β + α i ufficient o that (µ f (v) + β)/α + 1 (1+γ)µ f (v)/α and (µ f (v) β)/α 1 (1 γ)µ f (v)/α. Finally, conider the node in W. Suppoe for an upper bound on the number of lot per node that C (f) = (1 β) r a (ee Lemma 2.6). If φ > 0 i choen o that v W (1 + φ)µ f (v)/α r a /α, then it uffice to aign at mot (1 + φ)µ f (v)/α + 1 lot to every node v C f to cover an a -fraction of the lot in f. Since v W µ f (v) = C (f), we have to chooe φ o that (1 + φ)(1 β) r a /α r a /α, which work for φ 1/(1 β) 1 = β/(1 β). For a lower bound, uppoe that C (f) = (1 + β) r a. If φ > 0 i choen o that v W (1 φ)µ f (v)/α r a /α, then at leat (1 φ)µ f (v)/α 1 lot can be aigned to every node v C (f) without exceeding an a -fraction of the lot in f. For thi we have to chooe φ o that (1 φ)(1 + β) r a /α r a /α which work for φ 1 1/(1+β) = β/(1+β). Hence, γ β/(1 β)+α i ufficient for both cae. Finally, notice that there i thi O(log n) term in the bound in Lemma 2.6 that we ignored o far. However, whenever thi term i dominant for a et of node in ome ubframe, the um of their capacitie i at mot δ for ome contant δ > 0 that can be made arbitrarily mall depending on σ. Hence, ince we are conidering only three et of node, thoe et of node where the O(log n) term i negligible (and can therefore be covered by β) repreent a total capacity of at leat 1 2δ, which i ufficient to fill all lot jut with thee node, or leave enough pace for the other node, a long a δ i ufficiently mall compared to α and β. If condition 1 and 2 are true, alo condition 3 can be met. To how thi, conider the lot to be numbered row-wie from 1 to r /α by giving lot (i, j) in group i the number (j 1) /α + i. Aign the lot to the node o that each node v C(f) own a conecutive equence of lot. Since every node own at mot /α lot, no group can have two lot owned by the ame node, which prove our claim. The challenge, of coure, will be to maintain thee condition a the ytem change without rearranging too many lot aignment. Before we how how to do thi, we prove that the table condition allow u to maintain fairne. Lemma 2.9. If the table condition are met, then for

every node v in the ytem and every data item d, Pr[v tore a copy of d] i in [(1 γ)(1 ɛ)rc v, (1+γ)(1+ɛ)rc v ] for the γ in condition 2 and ɛ a choen for the interval rounding. Proof. Conider any node v and data item d. Let p = I (v) mod 1, where I (v) i the rounded form of I(v). For an area of ize p in [0, 1), v ha a multiplicity of I (v), and for the remaining area of ize (1 p) in [0, 1), v ha a multiplicity of I (v). Moreover, it follow from the table condition that for any node v in ome ubframe f, Pr[v elected by a data item] [(1 γ)µ f (v)/, min{(1+γ)µ f (v), }/]. Hence, it hold for data item d that Pr[v tore a copy of d] p (1 γ) I (v) = (1 γ) I (v) + (1 p) (1 γ) I (v) (1 γ)(1 ɛ)r c v Similarly, it hold that Pr[v tore a copy of d] (1 + γ)(1 + ɛ)r c v, which implie the lemma. Since ɛ > 0 and γ > 0 can be made arbitrarily mall, the lemma implie that SPREAD can be made fair. Next we how that SPREAD can alo be made adaptive. 2.4 Amortized adaptivity. We tart with amortized adaptivity, i.e., we how how to perform update o that movement of data copie can alway be charged to capacity change in the pat. SPREAD doe not need to perform any adaptation of the lot a new data item are added or old data item are removed from the ytem ince Lemma 2.9 implie that the data item in the ytem will remain fairly ditributed among the node. Hence, it remain to decribe how to react to change in the capacitie of the node. Suppoe that the capacitie of the node change from c 1,..., c n to c 1,..., c n. For each node identifier v that ha not been ued before (i.e., n increae), ome ubframe may have to be cut into maller ubframe that take over the table of the previou ubframe. Hence, if the previou ubframe atified the table condition, the new one will alo do o. Our rounding condition may then require update to thee table, which can be charged to pat capacity change a for the other cae below. Afterward, we tart with adapting the interval I(v) to c 1,..., c n. Thi may caue change in C(f) of a ubframe f, which may require update in it table (or table). Let u call a ubframe f dirty if it i in F v, contain the endpoint of I(v) and I(v) F v. Otherwie, it i called clean. Firt, we decribe how to update the table of a clean ubframe, and then we conider dirty ubframe. Updating a clean ubframe f. Let C(f) be the multiet of node in f before the change and C (f) be the multiet of node in f after the change in capacitie. If C (f) C(f), we go through the following tage. 1. Pairing tage: Suppoe that the total decreae in the multiplicitie of node in C(f) i δ d and the total increae in the multiplicitie of node in C(f) i δ i. Then we can identify δ i δ d pair of node (v, w) where v want to decreae it multiplicity wherea w want to increae it multiplicity. For each uch pair, we et µ f (v) := µ f (v) 1 and µ f (w) := µ f (w) + 1 and then change lot aignment until table condition 2 i atified for v and w. For each uch lot reaignment, we ditinguih between three cae. If both v and w violate condition 2, then a lot of v i given to w. If only v violate condition 2, we give a lot of v to any node w who can till take a lot without violating condition 2 (we will ee below that uch a node w can alway be found, w.h.p.). If only w violate condition 2, we give a lot from any node v who can loe a lot without violating condition 2 to w. For each lot x given from ome node u to ome node u, we ue the following lot witching trategy to preerve table condition 3. Switching trategy: If x belong to ome group g in which no other lot i aigned to u, we are done. Otherwie, there mut be a group g with no lot aigned to u ince otherwie u would have more than /α lot at the end, violating condition 2. Since condition 3 wa true before the movement, there mut be a lot x in g that i aigned to a node u that ha no lot in g. Then witch lot x and x among u and u, which repair condition 3. 2. Movement tage: After the pairing tage, we only have node left that all want to decreae or increae their multiplicitie. We conider thee node by node. For each node v among thee, we update v multiplicity and then either move lot to v or away from v uing the lot witching trategy in the pairing tage, if neceary, until v atifie condition 2. Of coure, it i not obviou that uitable lot can alway be found for the reaignment (beide the pairing tage in which the node v and w till violate condition 2), but the following lemma implie that thi i poible. In it, C (f) repreent the current multiet during the proce of moving from C(f) to C (f).

Lemma 2.10. In any ituation in which C (f) i within C(f) and C (f), condition 1 and 3 are true and at mot one node violate condition 2, condition 2 can be repaired for it o that all table condition are met, w.h.p. Proof. For any node w C (f), let w be the number of lot w ha in the table T f. Let v be the node that i violating condition 2. Then v either need additional lot or ha to give up lot. Suppoe firt that v need additional lot. In thi cae, v < (1 γ)µ f (v)/α. A long a there i a node w with w 1 (1 γ)µ f (w)/α, we can move a lot from w to v until v (1 γ)µ f (v)/α. Then we repaired condition 2 for v without violating the condition 2 for any of the other node. Suppoe, however, that we reach a point in which v < (1 γ)µ f (v)/α but there i no node w any more with w 1 (1 γ)µ f (w)/α. In thi cae, the total number of lot occupied by the node in C (f) i le than (1 γ)µ f (v)/α + w v[(1 γ)µ f (w)/α + 1] < 1 γ µ f (w) + C (f) α w C (f) ( ) 1 γ = α + 1 C (f) If γ β/(1 β) + α, it follow from Lemma 2.5 that thi i at mot 1 2β α(1 β) C (f) 1 2β (1 + β) r α(1 β) w.h.p. It hold that (1 2β)(1+β) = 1 β 2β 2 < 1 β, o w w < r/α, which i a contradiction ince all lot mut be owned by a node at any time. Hence, it will alway be poible to reaign lot o a to repair condition 2 for v in thi cae. Next, we conider the cae that v need to give up lot. In thi cae, v get tuck if v > (1 + γ)µ f (v)/α and for all other node w, w + 1 > (1 + γ)µ f (w)/α. Then the total number of lot occupied by the node in C (f) i more than (1 + γ)µ f (v)/α + w v[(1 + γ)µ f (w)/α 1] > 1 + γ µ f (w) C (f) α w C (f) ( ) 1 α(1 β) + 1 µ f (w) C (f) w C (f) 1 (1 β) r = r/α α(1 β) w.h.p. Thi, however, i a contradiction. Hence, lot from v can be reaigned to other node until condition 2 hold for v. Next, we bound the number of lot reaignment. A tep in the movement or pairing tage i defined a the proce of fixing the table condition after the multiplicity of a node or pair of node ha changed by 1. Lemma 2.11. In each tep of the pairing or movement tage, at mot 2(1 + γ)/α lot have to be reaigned in order to repair the table condition. Proof. Firt, conider the movement tage. Suppoe that the multiplicity of ome node v increae by 1. We know that for the old multiplicity µ f (v) of v it hold that v (1 γ)µ f (v)/α. Hence, at mot 1/α lot have to be moved to v to atify v (1 γ)(µ f (v)+1)/α. Since each lot movement may require a flip with another lot to repair condition 3, the total number of lot reaignment i at mot 2/α in thi cae. For the cae that the multiplicity of ome node v decreae by 1, at mot 2(1 + γ)/α lot have to be reaigned. The wort cae happen if µ f (v) wa previouly 1 and v had (1 + γ)/α lot. Next, conider a tep of the pairing tage. Suppoe that the multiplicity of node v decreae by 1 while the multiplicity of w increae by 1. Then v ha to give up at mot (1 + γ)/α lot while w ha to get at mot 1/α lot in order to repair condition 2. In any tep of repairing condition 2 for v and/or w (v give a lot to w, or v give up a lot, or w gain a lot), at mot 2 lot reaignment are neceary, o altogether at mot 2(1 + γ)/α lot reaignment are needed to repair condition 2 for v and w. Hence, given a total change in the multiplicitie of the node by µ, at mot 2µ(1 + γ)/α lot reaignment are neceary to get from C(f) to C (f). Given k lot reaignment, the probability that a pecific copy of a data item d with g 1 (d) f need to be replaced i equal to k/(r /α). Thu, the expected number of copy movement i at mot 2µ(1 + γ)/α r /α f r D = 2(1 + γ)µ f D where D U i the et of data item that are tored in the ytem. A change in multiplicitie by µ can be charged to a capacity change of c(f) f µ/( r) with repect to f in the pat, and the way we perform interval rounding make it poible that every capacity change i charged at mot once. Thu, with repect to c(f), the

expected number of copy movement i at mot 2(1 + γ) r c(f) D = 2(1 + γ)c(f) r D According to Fact 1.1, a capacity change of c(f) require the replacement of at leat c(f) r D /2 copie for the copy ditribution to remain fair with repect to f. Hence, for clean ubframe SPREAD i amortized 4(1 + γ)-adaptive. Updating a dirty ubframe f. If f i dirty, we maintain two table for f. One table, T 1, for the interval f 1 from the tarting point of f till the endpoint of I(v), and one table, T 2, for the interval f 2 from the endpoint of I(v) till the endpoint of f. The multiet C(f 1 ) and C(f 2 ) are equal to C(f) with the only difference that C(f 1 ) contain a copy of node v while C(f 2 ) doe not contain v. Our goal i to make ure that T 1 and T 2 differ in at mot 2(1 + γ)/α lot, which we call the proximity condition. Thi i enured by the following trategy. Suppoe that C(f) tay the ame but the ize of I(v) change. Then we only update f 1 and f 2 accordingly and leave the table T 1 and T 2 a before, which atifie the proximity condition. If C(f) only change becaue the endpoint of I(v) enter f from below, then T 2 inherit the table of f, and the table for f 1 i obtained by applying lot reaignment for v to T 2 until the table condition are met for f 1. According to Lemma 2.11, thi require the reaignment of at mot 2(1 + γ)/α lot, o the proximity condition hold afterward. If C(f) only change becaue the endpoint of I(v) i moving below f, then T 2 i choen a the table for f. Similar olution can be found if the endpoint of I(v) enter or leave f from above. In all other cae, we firt ignore a potential change in I(v), which mean that the ize of f 1 and f 2 remain the ame. We then adapt the table T i of the interval f i of larget ize among f 1 and f 2 a decribed for the table of a clean ubframe f to get from C(f) to C (f) (ignoring change in C(f) due to I(v) entering or leaving f), and then we contruct the table of the other interval by performing at mot 2(1 + γ)/α further lot reaignment in order to remove or add lot for v. Afterward, we update the ize of f 1 and f 2 if neceary (i.e., if I(v) ha changed). Next, we bound the adaptivity of SPREAD for dirty ubframe. Suppoe that C(f) tay the ame but the ize of I(v) change by ome l in f. Then thi can be charged to a change in capacity of v of c = l/( r). Hence, the proximity condition enure that the expected number of copy movement i at mot l 2(1 + γ)/α r /α r D = 2(1 + γ)c r D which implie that, with repect to c, SPREAD i 4(1 + γ)-adaptive in thi cae. If C(f) only change becaue the endpoint of I(v) enter f from below or i moving below f, and thi i aociated with a change of I(v) of length l with repect to f, then it follow analogouly to the firt cae that SPREAD i 4(1 + γ)-adaptive. It remain to conider any remaining cae. Let µ 1 be the total change in multiplicitie in C(f) (ignoring the change caued by I(v)). W.l.o.g., uppoe that f 1 i larger than f 2. Then at mot 2µ(1 + γ)/α lot are reaigned in T 1 and at mot (µ + 2)2(1 + γ)/α lot are reaigned in T 2. The latter bound hold becaue T 2 differ in at mot 2(1 + γ)/α lot from T 1 before and after the reaignment. Since a total change of µ can be charged to a capacity change of c(f) f µ/( r) with repect to f in the pat, f 1 f 2, and µ 1, it follow from the argument for clean ubframe that the expected number of copy movement due to thee change i at mot ( µ 2(1 + γ)/α ) (µ + 2)2(1 + γ)/α f r /α 2 r D + r /α 2µ f 2(1 + γ)/α D r /α 4(1 + γ) r c(f) D = 4(1 + γ)c(f) r D For any additional change due to I(v), SPREAD i 4(1+γ)-adaptive. Hence, overall SPREAD i amortized 8(1 + γ)-adaptive for dirty ubframe. Summing up the adaptivity bound over all ubframe reult in an amortized adaptivity of 8(1 + γ). 2.5 Adaptivity. In order to get from amortized adaptivity to adaptivity, we replace the determinitic rounding rule for the interval above by a randomized rounding rule. More pecifically, we chooe an additional (peudo-)random hah function h : V [0, 1), and for every interval I(w) and ubframe f = [x, y) in F v with v w (or I(w) F w ), we check the following: Randomized rounding condition: 1. Whenever the endpoint of I(w) croe x+h (v)/ f mod 1 from below, µ f (w) i increaed by 1. 2. Whenever the endpoint of I(w) croe x+h (v)/ f mod 1 from above, µ f (w) i decreaed by 1. With thi rule we obtain the following reult: Lemma 2.12. For any capacity change, SPREAD i 8(1 + γ)-adaptive.

Proof. Firt, uppoe that n doe not change. Conider any capacity change in the ytem, and for any node v let δi(v) be the interval repreenting the difference between I(v) before and I(v) after the change. Suppoe that the tarting point of δi(v) i in ome ubframe f and the endpoint of δi(v) i in ome ubframe f. Firt, conider the cae that δi(v) f, i.e., f = f. Then it i eay to check that the probability that µ f (v) increae or decreae by 1 i equal to δi(v) / f. We know from above that an increae or decreae of a multiplicity by 1 in a ubframe f require the replacement of an expected number of at mot (4(1 + γ) f /) D copie in the ytem. Since c v changed by δc v = δi(v) /(r ), thi mean that the expected number of copie replaced due to v i at mot ( δi(v) / f ) (4(1 + γ) f )/ D = 4(1 + γ)δc v r D. For δi(v) f, imilar argument alo yield that the expected number of copie replaced due to v i at mot 4(1 + γ)δc v r D. If n increae, then for any rounded I(v) with endpoint in ome ubframe f = [x, y) before the increae, I(v) i only rounded again, applying the new decompoition, if I(v) I (v) and the endpoint of I(v) pae x + h (v)/ f or y, or if I (v) I(v) and the endpoint of I(v) pae x or x + h (v)/ f, which preerve our adaptivity bound. Combining all of the reult in thi ection yield Theorem 1.1. Note that with the help of Chernoff bound the adaptivity bound can alo be hown to hold w.h.p. (up to minor order term) if the total capacity change i ω(ϕ log n), where ϕ i an upper bound on the maximum ize of a ubframe in the ytem. Hence, the maller the ubframe, the maller will be the deviation from the expected number of replaced copie. 2.6 Variable number of copie per data item. If the number of copie per data item varie but i upper bounded by r, then we jut need to lightly adapt our torage trategy. For any data item with r r copie, we firt elect a group of r ditinct node a before and then tore r copie among r of thee node by electing a (peudo-)random tarting node v in the group (via ome additional hah function) and then toring copie at the ubequent r node in the group (where we treat the group a a ring). It i not difficult to how that thi preerve all the propertie hown above for data item with redundancy exactly r. Reference [1] LH* RS : A high-availability calable ditributed data tructure uing reed olomon code. In SIGMOD Conference, page 237 248, 2000. [2] A. Brinkmann, S. Effert, F. M. auf der Heide, and C. Scheideler. Dynamic and redundant data placement. In Proc. of the IEEE International Conference on Ditributed Computing Sytem (ICDCS), 2007. [3] A. Brinkmann, K. Salzwedel, and C. Scheideler. Efficient, ditributed data placement trategie for torage area network. In Proc. of the 12th ACM Sympoium on Parallel Algorithm and Architecture (SPAA 00), page 119 128, 2000. [4] A. Brinkmann, K. Salzwedel, and C. Scheideler. Compact, adaptive placement cheme for non-uniform ditribution requirement. In Proc. of the 14th ACM Sympoium on Parallel Algorithm and Architecture (SPAA 02), page 53 62, 2002. [5] R. J. Enbody and H. C. Du. Dynamic hahing cheme. ACM Comput. Surv., 20(2):850 113, 1988. [6] R. Honicky and E. Miller. A fat algorithm for online placement and reorganization of replicated data. 2003. [7] R. J. Honicky and E. L. Miller. Replication Under Scalable Hahing: A Family of Algorithm for Scalable Decentralized Data Ditribution. In Proceeding of the 18th IPDPS Conference, 2004. [8] D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Conitent hahing and random tree: Ditributed caching protocol for relieving hot pot on the World Wide Web. In Proc. of the 29th ACM Sympoium on Theory of Computing, page 654 663, 1997. [9] W. Litwin, J. Menon, and T. Rich. LH* cheme with calable availability. Technical Report RJ 10121 (91937), IBM Reearch, Almaden Center, May 1998. [10] W. Litwin and M.-A. Neimat. High-availability LH* cheme with mirroring. In Conference on Cooperative Information Sytem, page 196 205, 1996. [11] W. Litwin, M.-A. Neimat, and D. A. Schneider. LH* - a calable, ditributed data tructure. ACM Tran. Databae Syt., 21(4):480 525, 1996. [12] W. Litwin and T. Rich. LH* G : A high-availability calable ditributed data tructure by record grouping. IEEE Tranaction on Knowledge and Data Engineering, 14(4):923 927, 2002. [13] C. McDiarmid. On the method of bounded difference. In J. Siemon, editor, Survey in Combinatoric. London Mathematical Society Lecture Note Serie 141, Cambridge Univerity Pre, 1989. [14] C. McDiarmid. Concentration. In M. Habib, C. Mc- Diarmid, J. Ramirez-Alfonin, and B. Reed, editor, Probabilitic Method for Algorithmic Dicrete Mathematic, page 195 247. Springer Verlag, Berlin, 1998. [15] P. Sander. Reconciling implicity and realim in parallel dik model. In Proc. of the 12th ACM-SIAM Sympoium on Dicrete Algorithm (SODA), page 67 76. SIAM, Philadelphia, PA, 2001. [16] C. Schindelhauer and G. Schomaker. Weighted ditributed hah table. In SPAA 05: Proceeding of the eventeenth annual ACM ympoium on Parallelim in algorithm and architecture, page 218 227, New York, NY, USA, 2005. ACM Pre.