Reliability at Scale

Size: px

Start display at page:

Download "Reliability at Scale"

Shana Flowers
6 years ago
Views:

1 Reliability at Scale Intelligent Storage Workshop 5 James Nunez Los Alamos National lab LA-UR & LA-UR May 15, 2007

2 A Word about scale Petaflop class machines LLNL Blue Gene 350 Tflops 128k processors 2005 Storage Demands: 896 TB LANL Road Runner 1.4 Pflops Hybrid System 2008 Storage Demands: 3 PBytes (projected)

3 Why do you need so much storage?

What Drives Us? ASCI Balanced System Approach Disk Parallel I/O GigaBytes/sec Memory TeraBytes 2500 100000 TeraBytes 10000 Computing Speed 250 25 1000 100 5 50 500 0.5 5000 5 TFLOP/s 5000 500 50 2.

4 What Drives Us? ASCI Balanced System Approach Disk Parallel I/O GigaBytes/sec Memory TeraBytes TeraBytes Computing Speed TFLOP/s Year 2012 Application Performance Network Speed Gigabits/sec Archival Storage PetaBytes Computational Computational Resource Resource Scaling Scaling for for ASCI ASCI Physics Physics Applications Applications 1 1 FLOP/s FLOP/s Peak Peak Compute Compute Byte/FLOP/s Byte/FLOP/s Memory Memory Byte/FLOP/s Byte/FLOP/s Disk Disk Byte/s/FLOP/s Byte/s/FLOP/s Peak Peak Parallel Parallel I/O I/O bit/s/flop/s bit/s/flop/s Peak Peak Network Network Byte/FLOP/s Byte/FLOP/s Archive Archive

5 Machines are Getting Faster Number of Processing Units in different machines Number of Processing Units Blue Mountain 3TF (1996) White 12TF (2001) Q 20TF (2003) Purple 100TF (2006) BG/L 300TF (2006) Machine - Year Cpu Speeds vs. Chip Type (year) 3500 Speed (MHz) Intel 8088 (1987) HP PA (1989) MIPS R4000 (1991) DEC Alpha 4 (1994) DEC Alpha 5 (1996) DEC Alpha 21264A (1999) U N C L A Proccessor S S I F I E type D (year) Intel Itanium (2001) Sun Ultra SPARC IIIcu (2001) AMD Opteron (2002) Intel Xeon (2006)

6 Disk Drives are Getting Denser Storage Capacity vs. Disk Type MegaBytes Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Disk Type Seagate Barracuda-4 4.3G-72k Seagate Barracuda 18G- 72k Seagate Cheetah 73G- 10k Data Transfer Rate vs. Disk Type MegaBytes/Sec Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G-72k Seagate Barracuda 18G- 72k Seagate Cheetah 73G- 10k Disk Type

7 Density versus Bandwidth Disks are getting faster, but not at same rate as density read/write bandwidth is not keeping up with machines Data Rate / MB (density) 0.12 MB/Sec per MB Ramac Fuji Eagle 690M-36k Seagate Elite 1.4G-54k Seagate Barracuda-4 4.3G-72k Seagate Barracuda 18G- 72k Seagate Cheetah 73G- 10k Disk Type

8 The ASC I/O Ratio and past over engineering of the BW ASC ratios (1 GByte/sec per Tflop and 20 Bytes/flop disk) In 1996 on a 3 Tflop system, 20 bytes/flop is 60 TBytes of disk, which yielded about 48 GigaBytes/sec which was over engineered by a factor of 16X for BW In 2002 on a 20 Tflop system, 20 bytes/flop is 400 Tbyte of disk, which yielded about 40 Gigabytes/sec which was over engineered by a factor of 2X for BW Today for a 100 Tflop machine, 20 bytes/flop is 2000 Tbytes of disk yields a little over 100 Gigabytes/sec, which is not over engineered at all. We do not enjoy having far more BW than we really needed to get the space anymore!

9 Classical RAID Plus 1 Rebuild Read the remaining disks, XOR, and write the result. Speed ultimately governed by write speed of a (single) target disk or read of N disks This is true for N+1 and N+2 with Classical RAID A0+B0+D0+P0 Storage Blade Storage Blade Storage Blade Storage Blade Storage Blade

10 Classical RAID Rebuild Time Due to increase in density and little increase in data rate, rebuild times get worse and worse, from minutes, to hours, to days raising chances of 2-3 disk failure more and more Classical Raid Rebuild Time hours minimum busy tf tf tf tf tf years

11 More bad news, rebuild is under pressure, unrecoverable bit error rates (UBER) on read High end 73GB FC 2.5 inch 10k drives UBER=10 to the 15 th bits SATA 400GB 3.5 inch drives UBER=10 to the 14 th bits Rebuild High end 4+N have to read 292GBytes or 2.3*10 to the 12 bits (error on rebuild every few thousand) SATA 4+N have to read 1600GBytes or 1.2*10 to the 13 bits (error on rebuild every few tens) High end 8+N have to read 584GBytes or 4.6*10 to the 12 bits (error on rebuild every few thousand) SATA 8+N have to read 3200GBytes or 2.6*10 to the 13 bits (error on rebuild every few tens)

12 Plus 1 RAID 5 Independent data disks with distributed parity blocks. Data is striped across a number of storage devices and a parity stripe is written for fault tolerance. Parity load is shared. With disk blocksizes getting bigger you must aggregate more and more data to do an efficient write (no read, update, write)

RAID N Plus 2 Examples RAID 6 Row Diagonal parity Lets look at one scheme Normal XOR parity is calculated straight across the disk blocks Diagonal parity is calculated on diagonals, there are

13 RAID N Plus 2 Examples RAID 6 Row Diagonal parity Lets look at one scheme Normal XOR parity is calculated straight across the disk blocks Diagonal parity is calculated on diagonals, there are other methods based on polynomials You need to have way more data around to do efficient parity calculation This means you have to aggregate more data to get efficient writes (no read,update,write)

14 What do our applications do?

15 The Applications versus the Industry CPU s are not getting faster, so we are getting more CPU s. Memory per processor is not going up appreciably, in some cases it is going down Therefore, apps are not going to write larger writes (and writes are already too small for current storage systems) But RAID/Disks are requiring larger and larger write ops for efficiency

16 N-to-N example T P0 P P0 H P0 Process 0 T P1 P P1 H P1 Process 1 T P2 P P2 H P2 Process 2 T P0 P P0 H P0 file0 T P2 P P2 H P2 file2 T P1 P P1 H P1 file1

17 N-to-1 non-strided example Process 0 T P0 P P0 H P0 Process 1 T P1 P P1 H P1 Process 2 T P2 P P2 H P2 T P0 P P0 H P0 T P1 P P1 H P1 T P2 P P2 H P2

18 N-to-1 strided example Process 0 T P0 P P0 H P0 Process 1 T P1 P P1 H P1 Process 2 T P2 P P2 H P2 T P0 T P1 T P2 P P0 P P1 P P2 H P0 H P1 H P2

19 N-to-1 strided evaluation Advantages Simplest book-keeping for N-to-M restart Read each element contiguously, resplit for M Simplest formatting for visualization Visualization typically only interested in small number of variables, can read each contiguously Smaller number of files to manage Can help with archiving and with managing your data Disadvantages Small, unaligned writes False sharing Read-modify-writes Note: All advantages relative to user, All disadvantages relative to storage system.

20 Applications seem to want to use N to 1 strided for convenience, for big writes this is not an issue but small writes are problematic

Microprocessor trends are changing Moore s law still holds but is now being realized differently Clock frequency, chip power, & instructionlevel-parallelism (ILP) have all plateaued Multi-core is

are headed downward (predominantly caused by increased core counts) Key findings of Jan.

21 Microprocessor trends are changing Moore s law still holds but is now being realized differently Clock frequency, chip power, & instructionlevel-parallelism (ILP) have all plateaued Multi-core is here today and manycore ( 32 ) looks to be the future Scalability of full complex & cache-based core designs to manycore designs is likely problemmatic Memory bandwidth and memory capacity per core are headed downward (predominantly caused by increased core counts) Key findings of Jan IDC Study: Next Phase in HPC new ways of dealing with parallelism will be required must focus more heavily on bandwidth (flow of data) and less on processor References: IDC report #205025, January 2007 UC Berkeley UCB/EECS LASCI-06 Burton Smith keynote Reinventing Computing U N C L A S S I F I E From D Burton Smith, LASCI-06 keynote, with permission; data from Herb Sutter, Microsoft From LA-UR Salishan RR Talk Intel 80-core

22 A Disturbing Summary ASC ratio driven BW over engineering is no more You have to involve more disks to do the job Number of disks to get the BW is going through the roof Rebuild times get worse and worse Plus 2 technologies don t really solve the problem reliability/rebuild problem It takes larger and larger write operations to be efficient Applications are having each process write less data as cores go up

23 So, do we go home and call it quits?

24 Some Solutions Data Aggregation in Middleware ROMIO s (existing) aggregation Northwestern s work on DAChe and Persistent File Domains in MPI-IO (ROMIO) Per object Parity

25 Per Object Parity: Enables scalable rebuild and helps address UBER issue! Parity by object (not by disk) Different files are striped across different disks Scalable object rebuild Gets list of objects that are partially on the bad disk Starts many parallel processes to read the other related disks for each object and recalculate and write missing piece on any disk in the system and updates map Rebuild is scalable, not gated by single disk write anymore The more rebuilding processes the faster rebuild Each rebuilder is gated by reading N drives and writing to any one drive Rebuild ratio N reads to 1 write UBER based read error only yields one bad object not bad entire disk Object or Disk Server Object or Disk Server File System Clients Object or Disk Server Object or Disk Server Obj Obj Object or Disk Server P Object or Disk Server 4 P Obj P

26 Some Solutions Per object Parity - Where At the disk In the network In the client RAID10 Forget about parity, make two or three copies of the data

27 Per Object Parity + RAID10 : Scalable Rebuild All same as before but Rebuild ratio is 1 to 1 (very nice and very simple) File System Clients Object or Disk Server Object or Disk Server Object or Disk Server Object or Disk Server Object or Disk Server Object or Disk Server Obj Obj Obj

28 RAID10 versus RAID5 Results

29 RAID10 versus RAID5 Results

30 RAID10, are there any drawbacks Sure, sending two copies of the data to disk Doubles the space needed Total BW is cut in half But Scales well for all problems, N to N, N to 1 non strided and N to N strided (even small) Simple, and that is important in a world with100k disks Helps with scalable rebuild which is needed for faster rebuild Helps with per object RAID which helps the UBER problem And besides that, we can use that density that we are getting more or less for free

31 Bottom line on reliability at scale! Had to get away from RAID5 due to rebuilds and number of disks RAID6 could work with lots of cache in front of it Because we will be implementing a scalable global parallel file system that is shared between all future clusters It will have a life well beyond any one cluster It will become extremely large and need reliability and availability requirements well beyond any single cluster file system Per object RAID is fundamental approach to solving our problem Hurts small unaligned writes which is becoming more prevalent We need scalable rebuild - implies per object parity, done on the client RAID10 is simpler, and scales better But we need RAID10 to fix the N to 1 small strided application writing problem RAID10 seems to have addresses this and is the direction we are following

416 Distributed Systems

416 Distributed Systems RAID, Feb 26 2018 Thanks to Greg Ganger and Remzi Arapaci-Dusseau for slides Outline Using multiple disks Why have multiple disks? problem and approaches RAID levels and performance