A Tale of Two Erasure Codes in HDFS

Size: px

Start display at page:

Download "A Tale of Two Erasure Codes in HDFS"

Egbert Horton
5 years ago
Views:

1 A Tale of Two Erasure Codes in HDFS Dynamo Mingyuan Xia *, Mohit Saxena +, Mario Blaum +, and David A. Pease + * McGill University, + IBM Research Almaden FAST 15 何军权

2 Outline Introduction & Motivation Design Evaluation Conclustions Related work 2

3 Introduction & Motivation 3

4 Big Data Storage Reliability and Availability Replication: 3-way replication Erasure Code: Reed-Solomon(RS), LRC GFS 3-way replication 3x, 2003 GFS v2 RS, 1.5x, 2012 FB HDFS LRC, 1.66x, 2013 FB HDFS RS, 1.4x, 2011 Azure LRC, 1.33x,

5 Popular Erasure Code Families Product Code(PC) Local Reconstruction Code(LRC) Other Reed-Solomon(RS) a 0 a 1 a 2 a 3 a 4 h a b 0 b 1 b 2 b 3 b 4 h b P 0 P 1 P 2 P 3 P 4 h PC a 0 a 1 a 2 a 3 a 4 a 5 G 1 a 6 a 7 a 8 a 9 a 10 a 11 G 2 L 0 L 1 L 2 L 3 L 4 L 5 LRC 5

6 Erasure Code Facebook HDFS RS(10,4) Compute 4 parities per 10 data blocks All blocks store in different storage nodes Storage Overhead: 1.4x D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 P1 P2 P3 P4 6

7 Erasure Code High Degraded Read Latency Read to an unavailable block requires Multiple disk reads, network transfers and compute cycles to decode Client Read exception HDFS 7

8 Erasure Code Long Reconstruction Time Facebook's Cluster: 100K blocks lost per day 50 machine-unavailablility events per day Reconstruction traffic: 180TB per day Reconstruction Job HDFS 8

9 Erasure Code Recover Cost Degraded Read Latency Reconstruction Time Recover Cost: the total number of blocks required to reconstruction a data block after failure 9

10 Recovery Cost vs. Storage Overhead Conclusion Storage Overhead and Reconstruction Cost are a tradeoff in single erasure code. FB HDFS RS Azure LRC GFS v2 RS FB HDFS LRC GFS 3-way Repl 10

11 How to balance? Storage Overhead Recovery Cost 11

12 Data Access Skew Conclusions Only few data are "hot" P(freq > 10) ~= 1% Most data are "cold" P(freq <= 10) ~= 99% 12

13 Data Access Skew Hot data High access frequency A small fraction of data Cold data Low access frequency A major fraction of data A little improvement on read can gain a high read performance Hot Data: Decrease the Recovery Cost A few less of data to store can save huge storage space Cold Data: High Storage Efficiency 13

14 HACFS System State Tracks file states File size, last mtime Read count and coding state Adapting Coding Tracks system states Choose coding scheme based on read count and mtime Erasure Coding Providing four coding interfaces Encode/Decode Upcode/Downcode 14

15 Erasure Coding Algorithms Two different erasure codes Fast code: Encode the frequently accessed blocks to reduce the read latency and reconstruction time Provide overall low recovery cost Compact code: Encode the less frequently accessed blocks to get low storage overhead Maintain a low and bounded storage overhead 15

16 State Transition HACFS Recently created COND Fast Code 3-way replication Write cold COND COND' COND' COND : Read Hot and Bounded COND': Read Cold or Not Bounded Compact Code 16

17 Fast and Compact Product Codes(1) h a1 =RS(a 0,a 1,a 2,a 3,a 4 ) Pa 0 =XOR(a 0,a 5 ) a 0 a 1 a 2 a 3 a 4 h a1 a 5 a 6 a 7 a 8 a 9 h a2 Pa 0 Pa 1 Pa 2 Pa 3 Pa 4 Ph a a 0 a 1 a 2 a 3 a 4 h a1 a 5 a 6 a 7 a 8 a 9 h a2 b 0 b 1 b 2 b 3 b 4 h b1 b 5 b 6 b 7 b 8 b 9 h b2 c 0 c 1 c 2 c 3 c 4 h c1 c 5 c 6 c 7 c 8 c 9 h c2 P 0 P 1 P 2 P 3 P 4 Ph Fast Code (Product Code 2x5) Storage overhead: 1.8x Recovery Cost: 2 Compact Code (Product Code 6x5) Storage overhead: 1.4x 17

18 Fast and Compact Product Codes(2) P 0 =XOR(a 0,a 5,b 0,b 5,c 0,c 5 ) h a1 =RS(a 0,a 1,a 2,a 3,a 4 ) Pa 0 =XOR(a 0,a 5 ) a 0 a 1 a 2 a 3 a 4 h a1 a 5 a 6 a 7 a 8 a 9 h a2 Pa 0 Pa 1 Pa 2 Pa 3 Pa 4 Ph a a 0 a 1 a 2 a 3 a 4 h a1 a 5 a 6 a 7 a 8 a 9 h a2 b 0 b 1 b 2 b 3 b 4 h b1 b 5 b 6 b 7 b 8 b 9 h b2 c 0 c 1 c 2 c 3 c 4 h c1 c 5 c 6 c 7 c 8 c 9 h c2 P 0 P 1 P 2 P 3 P 4 Ph Fast Code (Product Code 2x5) Storage overhead: 1.8x Recovery Cost: 2 Compact Code (Product Code 6x5) Storage overhead: 1.4x Recovery Cost: 5 18

19 Fast and Compact LRC(1) {G 1,G 2 }=RS(a 0,a 1,..,a 11 ) L i =XOR(a i, a i+6 ) a 0 a 1 a 2 a 3 a 4 a 5 G 1 {G 1,G 2 }=RS(a 0,a 1,..,a 11 ) L i =RS'(a 0, a 1, a 2, a6, a 7, a 8 ) a 0 a 1 a 2 a 3 a 4 a 5 G 1 a 6 a 7 a 8 a 9 a 10 a 11 G 2 a 6 a 7 a 8 a 9 a 10 a 11 G 2 L 0 L 1 L 2 L 3 L 4 L 5 L 0 L 1 Fast Code (LRC(12,6,2)) Storage overhead: 20/12=1.67x Compact Code (LRC(12,2,2)) Storage overhead: 16/12=1.33x Recovery Cost: 2 Recovery Cost: 6 19

20 Upcoding for Product Codes Fast Code PC(2x5) a 0 a 1 a 2 a 3 a 4 h a1 a 5 a 6 a 7 a 8 a 9 h a2 Pa 0 Pa 1 Pa 2 Pa 3 Pa 4 Ph a Compact Code PC(6x5) a 0 a 1 a 2 a 3 a 4 h a1 a 5 a 6 a 7 a 8 a 9 h a2 b 0 b 1 b 2 b 3 b 4 h b1 b 5 b 6 b 7 b 8 b 9 h b2 b 0 b 1 b 2 b 3 b 4 h b1 b 5 b 6 b 7 b 8 b 9 h b2 Pb 0 Pb 1 Pb 2 Pb 3 Pb 4 Ph b c 0 c 1 c 2 c 3 c 4 h c1 c 5 c 6 c 7 c 8 c 9 h c2 P 0 P 1 P 2 P 3 P 4 Ph c 0 c 1 c 2 c 3 c 4 h c1 c 5 c 6 c 7 c 8 c 9 h c2 Pc 0 Pc 1 Pc 2 Pc 3 Pc 4 Ph c Parities h require no re-construction Parities P require no data block transfer All parities updates can be done in parallel 20

21 Downcoding for Product Codes Compact Code PC(6x5) Fast Code PC(2x5) a 0 a 1 a 2 a 3 a 4 h a1 a 0 a 1 a 2 a 3 a 4 h a1 a 5 a 6 a 7 a 8 a 9 h a2 a 5 a 6 a 7 a 8 a 9 h a2 Pa 0 Pa 1 Pa 2 Pa 3 Pa 4 Ph a b 0 b 1 b 2 b 3 b 4 h b1 b 5 b 6 b 7 b 8 b 9 h b2 c 0 c 1 c 2 c 3 c 4 h c1 c 5 c 6 c 7 c 8 c 9 h c2 P 0 P 1 P 2 P 3 P 4 Ph b 0 b 1 b 2 b 3 b 4 h b1 b 5 b 6 b 7 b 8 b 9 h b2 Pb 0 Pb 1 Pb 2 Pb 3 Pb 4 Ph b c 0 c 1 c 2 c 3 c 4 h c1 c 5 c 6 c 7 c 8 c 9 h c2 Pa 0 =XOR(a 0,a 5 ) Pc 0 =XOR(P 0,Pa 0,Pb 0 ) Pc 0 Pc 1 Pc 2 Pc 3 Pc 4 Ph c 21

Evaluation Platform CPU: Intel Xeon E5645 24 cores, 2.4GHz Disk: 7.

22 Evaluation Platform CPU: Intel Xeon E cores, 2.4GHz Disk: 7.2K RPM, 6*2TB Memory: 96GB Network: 1Gbps NIC Cluster size: 11 nodes Workload CC: Cloudera Customer FB: Facebook 22

23 Evaluation Metrics Degraded read latency Foreground read request latency Reconstruction time Background recovery for failures Storage overhead 23

Degraded Read Latency The Production systems: 16-21 seconds HACFS: 10-14

24 Degraded Read Latency The Production systems: seconds HACFS: seconds Bounded the storage overhead of HACFS LRC and PC to 1.4 and

Reconstruction Time A disk with 100GB data failed HACFS-PC takes about 10-35 minutes less than Production systems HACFS-LRC

25 Reconstruction Time A disk with 100GB data failed HACFS-PC takes about minutes less than Production systems HACFS-LRC is worse than RS(6,3) in GFS v2 To reconstruction global parities, HACFS-LRC need to read 12 blocks, but GFS v2 only 6 blocks 25

26 System Comparison Colossus FS:RS(6,3)-1.5x HDFS-Raid: RS(10,4)-1.4x Azure: LRC(12,2,2)-1.33x HACFS-PC: PC(2x5)-1.8x PC(6x5)-1.4x HACFS-LRC: LRC(12,6,2)-1.67x LRC(12,2,2)-1.33x 26

27 System Comparison Colossus FS:RS(6,3)-1.5x HDFS-Raid: RS(10,4)-1.4x Azure: LRC(12,2,2)-1.33x HACFS-PC: PC(2x5)-1.8x PC(6x5)-1.4x HACFS-LRC: LRC(12,6,2)-1.67x LRC(12,2,2)-1.33x lost block type HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure data block global parity fast: 2 fast: 2 comp: 5 comp: 6 fast: 5 fast: 12 comp: 6 comp:

28 System Comparison Colossus FS:RS(6,3)-1.5x HDFS-Raid: RS(10,4)-1.4x Azure: LRC(12,2,2)-1.33x HACFS-PC: PC(2x5)-1.8x PC(6x5)-1.4x HACFS-LRC: LRC(12,6,2)-1.67x LRC(12,2,2)-1.33x lost block type HACFS-PC HACFS-LRC Colossus FS HDFS-RAID Azure data block global parity fast: 2 fast: 2 comp: 5 comp: 6 fast: 5 fast: 12 comp: 6 comp:

29 Conclusions By using Erasure code, a lot of storage space can be saved. The production systems using a single erasure code can not balance the tradeoff between recovery cost and storage overhead very well. HACFS by using a dynamically adaptive coding can provide both low recovery cost and storage overhead. 29

30 Related Work f4 OSDI'14 Divide the cold and hot by the data age XOR-based Erasure Code--FAST 12 Combination RS with XOR. Minimum-Storage-Regeneration(MSR) Minimizes network transfers during reconstruction. Product-Matrix-Reconstruct-By-Transfer(PM-RBT) FAST 15 Optimal in terms of I/O, storage, and network bandwidth. 30

31 Thank You! 31

32 Acknowledgment Prof. Xiong Zigang Zhang Biao Ma CAS ICT Storage System Group 32

Coding for loss tolerant systems

Coding for loss tolerant systems Workshop APRETAF, 22 janvier 2009 Mathieu Cunche, Vincent Roca INRIA, équipe Planète INRIA Rhône-Alpes Mathieu Cunche, Vincent Roca The erasure channel Erasure codes Reed-Solomon