RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures

Similar documents
An Algorithm for a Two-Disk Fault-Tolerant Array with (Prime 1) Disks

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

One Optimized I/O Configuration per HPC Application

Distributed storage systems from combinatorial designs

416 Distributed Systems

Redundant Array of Independent Disks

Redundant Array of Independent Disks

Revisiting Memory Errors in Large-Scale Production Data Centers

A Tale of Two Erasure Codes in HDFS

Secure RAID Schemes from EVENODD and STAR Codes

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

arxiv: v2 [cs.pf] 10 Jun 2018

IBM Research Report. Construction of PMDS and SD Codes Extending RAID 5

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

CSE 4201, Ch. 6. Storage Systems. Hennessy and Patterson

S-Code: Lowest Density MDS Array Codes for RAID-6

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Ultimate Codes: Near-Optimal MDS Array Codes for RAID-6

Coding problems for memory and storage applications

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Analysis of Reliability Dynamics of SSD RAID

Reliability at Scale

IBM Research Report. Performance Metrics for Erasure Codes in Storage Systems

VMware VMmark V1.1 Results

FAULT TOLERANT SYSTEMS

CS/IT OPERATING SYSTEMS

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems

Fault Tolerance Technique in Huffman Coding applies to Baseline JPEG

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan

Coding for loss tolerant systems

DRAM Reliability: Parity, ECC, Chipkill, Scrubbing. Alpha Particle or Cosmic Ray. electron-hole pairs. silicon. DRAM Memory System: Lecture 13

Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction

A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes

IEEE TRANSACTIONS ON INFORMATION THEORY 1

Explicit Code Constructions for Distributed Storage Minimizing Repair Bandwidth

ArcGIS Data Models: Raster Data Models. Jason Willison, Simon Woo, Qian Liu (Team Raster, ESRI Software Products)

Quantitative Estimation of the Performance Delay with Propagation Effects in Disk Power Savings

Hierarchical Codes: A Flexible Trade-off for Erasure Codes in Peer-to-Peer Storage Systems

Enhancing Multicore Reliability Through Wear Compensation in Online Assignment and Scheduling. Tam Chantem Electrical & Computer Engineering

CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems for System Endurance and Performance

Qualitative vs Quantitative metrics

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde

BDR: A Balanced Data Redistribution Scheme to Accelerate the Scaling Process of XOR-based Triple Disk Failure Tolerant Arrays

On Two Class-Constrained Versions of the Multiple Knapsack Problem

Coping with disk crashes

Geodatabase An Introduction

CMP 338: Third Class

A Piggybacking Design Framework for Read- and- Download- efficient Distributed Storage Codes. K. V. Rashmi, Nihar B. Shah, Kannan Ramchandran

IOs/sec sec Q3 Figure 1: TPC-D Query 3 I/O trace. tables occur in phases which overlap little. Problem specication I/Os to

2. Accelerated Computations

Linear Programming Bounds for Robust Locally Repairable Storage Codes

IBM Research Report. Notes on Reliability Models for Non-MDS Erasure Codes

Data Canopy. Accelerating Exploratory Statistical Analysis. Abdul Wasay Xinding Wei Niv Dayan Stratos Idreos

CS 700: Quantitative Methods & Experimental Design in Computer Science

Sector-Disk Codes and Partial MDS Codes with up to Three Global Parities

Causal Consistency for Geo-Replicated Cloud Storage under Partial Replication

Fault Tolerance. Dealing with Faults

Geog 469 GIS Workshop. Managing Enterprise GIS Geodatabases

Geodatabase Essentials Part One - Intro to the Geodatabase. Jonathan Murphy Colin Zwicker

Direct Self-Consistent Field Computations on GPU Clusters

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

On Locally Recoverable (LRC) Codes

The Applicability of Adaptive Control Theory to QoS Design: Limitations and Solutions

Parallelization of the QC-lib Quantum Computer Simulator Library

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Figure 9.1: A Latin square of order 4, used to construct four types of design

Parallel Transposition of Sparse Data Structures

High Performance Computing

CS425: Algorithms for Web Scale Data

Design and implementation of a new meteorology geographic information system

RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE

Reliability and Power-Efficiency in Erasure-Coded Storage Systems

CORRECTNESS OF A GOSSIP BASED MEMBERSHIP PROTOCOL BY (ANDRÉ ALLAVENA, ALAN DEMERS, JOHN E. HOPCROFT ) PRATIK TIMALSENA UNIVERSITY OF OSLO

I N T R O D U C T I O N : G R O W I N G I T C O M P L E X I T Y

Scalable and Power-Efficient Data Mining Kernels

ArcGIS Deployment Pattern. Azlina Mahad

Lecture 13. Real-Time Scheduling. Daniel Kästner AbsInt GmbH 2013

Evaluating Overheads of Multi-bit Soft Error Protection Techniques at Hardware Level Sponsored by SRC and Freescale under SRC task number 2042

How to deal with uncertainties and dynamicity?

CPU Consolidation versus Dynamic Voltage and Frequency Scaling in a Virtualized Multi-Core Server: Which is More Effective and When

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Sam Williamson

QR Decomposition in a Multicore Environment

Leveraging Web GIS: An Introduction to the ArcGIS portal

I Can t Believe It s Not Causal! Scalable Causal Consistency with No Slowdown Cascades

Regenerating Codes and Locally Recoverable. Codes for Distributed Storage Systems

Road Ahead: Linear Referencing and UPDM

Stochastic Modelling of Electron Transport on different HPC architectures

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems

Reliability Modeling of Cloud-RAID-6 Storage System

Algorithms and Data Structures

Intel Stratix 10 Thermal Modeling and Management

EECS150 - Digital Design Lecture 26 - Faults and Error Correction. Types of Faults in Digital Designs

Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage

The virtue of dependent failures in multi-site systems

What s New. August 2013

Distributed Data Storage with Minimum Storage Regenerating Codes - Exact and Functional Repair are Asymptotically Equally Efficient

Composite Quantization for Approximate Nearest Neighbor Search

COMPUTER SCIENCE TRIPOS

Transcription:

RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures Guangyan Zhang, Zican Huang, Xiaosong Ma SonglinYang, Zhufan Wang, Weimin Zheng Tsinghua University Qatar Computing Research Institute, HBKU

RAID with Large Disk Pools RAID (Redundant Array of Inexpensive Disks) widely used Not designed to work directly on large arrays HGST 4U60G2 EMC VMAX3 Dell PowerVault MD3060e

Existing Ways of Using Large Disk Pools (1): Integrating Homogeneous RAID Groups 5+1 RAID-5 Slow recovery Striped (RAID-50) Concatenated (RAID-5C) Easy setup Deterministic addressing 60-disk pool Lack of flexibility

Existing Ways of Using Large Disk Pools (2): Building Heterogeneous RAID Groups Easy setup 7+1 RAID-5 Separate physical organization Hot spares Deterministic addressing Strong isolation Low peak performance Slow recovery 8+2 RAID-6 Lack of flexibility

Existing Ways of Using Large Disk Pools (3): Random Block Placement Separate logical organization, shared physical storage Logical Physical Randomly map to disks Near-balanced data distribution Flexible, adaptive setup More bookkeeping Imbalanced locally

RAID+: Guaranteed Uniform Distribution using 3D templates Logical Physical (5+1)-block stripe (6+2)-block stripe Deterministic addressing Uniform data distribution Flexible, adaptive setup Improved spatial locality

RAID+ 3D Template nx(n-1) matrix storing disk IDs Disk pool size n 22 1 38 54 13 35 Disk IDs for (5+1)-block RAID-5 stripe Deterministic mapping of k-block stripes to n- disk pool k decoupled from n Properties (see paper for details and proofs) Fault tolerance of RAID No blocks in stripe placed on same disk Guaranteed load balance Uniform distribution of data blocks, for both application I/O and recovery Ease of maintenance Deterministic, computable addressing Achieved by using Mutually Orthogonal Latin Squares (MOLS)

Introducing Latin Squares Latin square of order n n n array filled with n different symbols Each symbol occurring exactly once in each row and column 2 Latin squares orthogonal, if when superimposed, each of n 2 ordered pairs appears exactly once 1 2 3 4 5 2 3 4 5 1 3 4 5 1 2 4 5 1 2 3 5 1 2 3 4 1 2 3 4 5 3 4 5 1 2 5 1 2 3 4 2 3 4 5 1 4 5 1 2 3 1,1 2,2 3,3 4,4 5,5 2,3 3,4 4,5 5,1 1,2 3,5 4,1 5,2 1,3 2,4 4,2 5,3 1,4 2,5 3,1 5,4 1,5 2,1 3,2 4,3 Mutually Orthogonal Latin Squares (MOLS) Set of Latin squares, mutually orthogonal pair wise

Stacking MOLS into RAID+ Template 1 2 3 4 5 1 1 23 34 45 5 2 3 4 5 1 2 3 2 45 51 12 2 3 4 5 1 3 4 5 3 12 23 34 4 4 5 1 2 5 1 2 4 34 45 51 1 5 1 2 3 2 3 4 5 5 1 1 2 2 3 3 4 Do we have enough MOLS? Number of MOLS open question n-1 when n is power of prime number Valid RAID+ pool sizes For the i th square L i, L i [x,y] = i x + y nx(n-1) matrix storing disk IDs Disk pool size n 22 1 38 54 13 35 Disk IDs for (5+1)-block RAID-5 stripe

MOLS-based Addressing L i [x,y] = i x + y L 1 a b c d e 1 2 3 2 3 4 3 4 0 n=5 k=3 c a a a b b b c f g c h L 2 i j k l m L 3 n o P q r s t D 0 D 1 D 2 D 3 D 4 k=3 MOLS n(n-1)=20 stripes (n-1)k=12 blocks per disk

MOLS-based Addressing (n=5, k=3) a b c d 1 2 3 4 2 3 4 0 3 4 0 1 e 0 1 2 f 2 4 1 g 3 0 2 h 4 1 3 i 0 2 4 j 1 3 0 k 3 1 4 l 4 2 0 m 0 3 1 n 1 4 2 o 2 0 3 P 4 3 2 q 0 4 3 r 1 0 4 s 2 1 0 t 3 2 1 c a a a b d d b b c e e e c d g f f g f i h g h h j j i j i l k l k k m m n m l o n o o n q r p q q r s s q q s t t t r D 0 D 1 D 2 D 3 D 4 K=3 MOLS n(n-1)=20 stripes (n-1)k=12 blocks per disk

Impact of Stripe Ordering c a a a b b c b c Parity blocks stripe a: (1, 2, 3) Sequential read of 6 blocks => Disk access sequence: 1 2 stripe a: (1, 2, 3) stripe b: (2, 3, 4) 2 3 stripe b: (2, 3, 4) stripe c: (3, 4, 0) 3 4 stripe c: (3, 4, 0) Naïve stripe ordering Highly overlapped D 0 D 1 D 2 D 3 D 4 (n-1)k=12 blocks per disk

Sequential Access Friendly Stripe Ordering (I) Sequential read friendly ordering Intuition: traverse all disks (ignoring parity blocks) a 1 2 3 b 2 3 4 c 3 4 0 d 4 0 1 e 0 1 2 +1 1 2 3 +1 3 4 0 0 1 2 2 3 4 4 0 1 f 2 4 1 g 3 0 2 h 4 1 3 i 0 2 4 j 1 3 0 k 3 1 4 l 4 2 0 m 0 3 1 R-friendly ordering: A-C-E-B-D n 1 4 2 o 2 0 3 p 4 3 2 q 0 4 3 r 1 0 4 s 2 1 0 t 3 2 1

Sequential Access Friendly Stripe Ordering (II) Sequential write friendly ordering Intuition: traverse all disks (including parity blocks) +1 1 2 3 +1 4 0 1 2 3 4 0 1 2 3 4 0 a 1 2 3 b 2 3 4 c 3 4 0 d 4 0 1 e 0 1 2 f 2 4 1 g 3 0 2 h 4 1 3 i 0 2 4 j 1 3 0 k 3 1 4 l 4 2 0 m 0 3 1 n 1 4 2 o 2 0 3 p 4 3 2 q 0 4 3 r 1 0 4 s 2 1 0 t 3 2 1 W-friendly ordering: A-D-B-E-C

When Disk Failure Happens a 1 2 3 b 2 3 4 a 1 2 3 4 b 2 3 4 0 Stripe sees a disk only once => replace with disk ID in L 4 0 14 20 31 42 3 0 1 3 2 4 3 0 4 1 2 2 33 44 00 11 2 1 2 1 3 2 4 3 0 4 0 4 02 13 24 30 1 2 3 4 4 0 0 1 1 2 3 1 21 32 43 04 0 3 4 2 0 3 1 4 2 0 1 Still orthogonal! Spare square L 4 c 3 4 0 d 4 0 1 e 0 1 2 f 2 4 1 g 3 0 2 h 4 1 3 i 0 2 4 j 1 3 0 k 3 1 4 l 4 2 0 m 0 3 1 n 1 4 2 o 2 0 3 p 4 3 2 q 0 4 3 r 1 0 4 s 2 1 0 t 3 2 1 c 3 4 0 1 d 4 0 1 2 e 0 1 2 3 f 2 4 1 3 g 3 0 2 4 h 4 1 3 0 i 0 2 4 1 j 1 3 0 2 k 3 1 4 2 l 4 2 0 3 m 0 3 1 4 n 1 4 2 0 o 2 0 3 1 p 4 3 2 1 q 0 4 3 2 r 1 0 4 3 s 2 1 0 4 t 3 2 1 0

RAID+ Interim Layout a 1 2 3 4 b 2 3 4 0 c 3 4 0 1 d 4 0 1 2 e 0 1 2 3 f 2 4 1 3 g 3 0 2 4 h 4 1 3 0 i 0 2 4 1 j 1 3 0 2 k 3 1 4 2 l 4 2 0 3 m 0 3 1 4 n 1 4 2 0 o 2 0 3 1 p 4 3 2 1 q 0 4 3 2 r 1 0 4 3 s 2 1 0 4 t 3 2 1 0 c a a a b d d b b c e e e c d g f f g f i h g h h j j i j i l k l k k m m n m l o n o o n q r p p p r s s q q s t t t r 5 x12 Normal layout c a a a b d d b b c e e e c d g f f g f i h g h h j j i j i l k l k k m m n m l o n o o n q r p p p r s s q q s t t t r c d e g i j l m o q r s 4 x15 interim layout (after loss of 1 disk)

RAID+ Recovery c a a a b d d b b c e e e c d g f f g f i h g h h j j i j i l k l k k m m n m l o n o o n q r p p p r s s q q s t t t r c d e g i j l m o q r s interim layout (after loss of 1 disk) Guaranteed uniform load distribution Read and write evenly on all survivors Fast all-to-all data migration Copy in background to replacement/hotspare Repeat for more failures, without hot-spares N >> k in most cases! Pre-allocated recovery area Trading capacity for recovery performance Optimized for 1-disk failure Maintaining RAID level fault tolerance Optimization for 2+ simultaneous failures Pre-allocated space

Addressing Metadata Management c a a a b d d b b c e e e c d g f f g f i h g h h j j i j i l k l k k m m n m l o n o o n q r p p p r s s q q s t t t r Only records template base address Small overhead with typical template sizes In-template addressing easily computable Implemented in <20 lines of code RAID+ template (w. reserved recovery area)

Evaluation Implementation MD (Multiple Devices) driver for software RAID, in Linux kernel 3.14.35 Platform SuperMicro 4U storage server, 2 12-core Intel E5-2650 processors, 128GB DDR4 2 AOC-S3008L- L8I SAS JBOD adapters, each connected to 30-bay SAS3 expander backplane via 2 channels 60 Seagate Constellation 7200RPM 2TB HDDs Aggregate I/O channel bandwidth: 24GB/s Workloads Synthetic: fio benchmark 8 public block-level I/O traces, from MSR Cambridge and SPC I/O-intensive applications: GridGraph, TPC-C, Facebook-like, MongoDB

Evaluation Methodology Systems Common base RAID setting: (6+1) RAID-5 RAID H : CRUSH-style, hashing based random mapping RAID R : Pseudo-random number generation based random mapping RAID-5C RAID-50 Partitioned RAID5 Random RAID+ Metrics Recovery speed (online and offline), interference to application performance Data re-distribution after failures Normal performance: synthetic and application, single- and multi-workload More results in paper

Rebuild time (s) Reconstruction Performance: Offline Rebuild Two disk pool sizes: (56+3), (28+3) Data size per disk: 50GB 350 300 250 RAID50 RAIDR RAIDH RAID+ 307 307 200 150 100 50 99 143 83 60 102 41 0 28+3 56+3 Disk pool size

Speedup I/Os Normal I/O Performance : Multi-workload Running 4-app mix selected from 8 storage traces Evaluating all 70 mixes RAID+ brings avg. 2.05 weighted speedup over partitioned RAID-5, vs. 1.83 by RAID H Only 39 out of 280 runs experienced slowdown 30 25 20 15 10 5 usr_1 prn_0 src1_1 Fin2 16 14 12 10 8 6 4 2 usr_1 prn_0 src1_1 Fin2 0 1 2 3 4 5 6 7 8 9 10 Time (min) 0 200 220 240 260 280 300 Time (ms) RAID+ speedup over partitioned RAID-5 Sample load variation

Conclusions RAID+: new RAID architecture Flexible and efficient Virtual RAID groups on large disk enclosures Enabled by Latin-square based 3D templates Guaranteed uniform data distribution for both normal and 1-disk failure recovery I/O Deterministic addressing Utilizing all disks evenly while maintaining data locality Especially appealing to shared large disk enclosures Enterprise server, cloud, and datacenter