CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems for System Endurance and Performance

Size: px

Start display at page:

Download "CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems for System Endurance and Performance"

Amberly Park
6 years ago
Views:

1 Du YM, Xiao N, Liu F et al. CSWL: Cross-SSD wear-leveling method in SSD-based RAID systems for system endurance and performance. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 28(1): Jan DOI /s z CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems for System Endurance and Performance Yi-Mo Du ( ), Nong Xiao ( ), Member, IEEE, Fang Liu ( ), Member, CCF and Zhi-Guang Chen ( ) State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha , China yimodu@gmail.com; {nongxiao, liufang}@nudt.edu.cn; chenzhiguanghit@gmail.com Received December 31, 2011; revised March 30, Abstract Flash memory has limited erasure/program cycles. Hence, to meet their advertised capacity all the time, flashbased solid state drives (SSDs) must prolong their life span through a wear-leveling mechanism. As a very important part of flash translation layer (FTL), wear leveling is usually implemented in SSD controllers, which is called internal wear leveling. However, there is no wear leveling among SSDs in SSD-based redundant array of independent disks (RAIDs) systems, making some SSDs wear out faster than others. Once an SSD fails, reconstruction must be triggered immediately, but the cost of this process is so high that both system reliability and availability are affected seriously. We therefore propose cross-ssd wear leveling (CSWL) to enhance the endurance of entire SSD-based RAID systems. Under the workload of random access pattern, parity stripes suffer from much more updates because updating to a data stripe will cause the modification of other all related parity stripes. Based on this principle, we introduce an age-driven parity distribution scheme to guarantee wear leveling among flash SSDs and thereby prolong the endurance of RAID systems. Furthermore, age-driven parity distribution benefits performance by maintaining better load balance. With insignificant overhead, CSWL can significantly improve both the life span and performance of SSD-based RAID. Keywords solid state drive, redundant array of independent disk, wear-leveling, endurance 1 Introduction Solid state drives (SSDs) exhibit higher speed and lower power consumption than disks. To some degree, SSDs alleviate the I/O bottleneck in computer systems by replacing hard drive disks (HDDs). In terms of compatibility, SSDs offer standard interfaces like HDDs do; they can use previously developed hardware and software based on HDDs. However, there are three critical technical constraints faced by flash memory [1] : 1) the absence of in-place update, thereby requiring that whole blocks be erased before overwriting data on any page; 2) the absence of random writing on pages because pages on blocks must be programmed in a sequential rather than random order to ensure reliability; and 3) life limit, causing blocks to wear out after a certain number of program cycles. To cope with these constraints, many strategies have been proposed. In this paper, we mainly focus on the third constraint, that is, the limited erasure/program cycle of flash memory. Almost all SSD products in the market adopt the internal wear-leveling scheme to ensure that all blocks in SSDs wear out evenly and meet their advertised capacity. The over provision of capacity has the same function of guaranteeing SSDs advertised capacity whereas it also can be supplied for the sake of garbage collection. Internal wear leveling and the over provision of capacity work together to prolong the life span of SSDs, although they cannot increase the total number of program cycles of all blocks. The redundant array of independent disks (RAID) has become a very effective and popular method of constructing high-performance and reliable storage systems since its introduction in 1988 [2]. At first, RAID was used to increase storage capacity because the capacity of a single disk is too small. RAID uses the redundancy scheme to improve reliability and the stripe scheme to promote throughput. With the use of Regular Paper Supported by the National High Technology Research and Development 863 Program of China under Grant No. 2013AA013201, the National Natural Science Foundation of China under Grant Nos , , , The preliminary version of the paper was published in the Proceedings of NPC Corresponding Author 2013 Springer Science + Business Media, LLC & Science Press, China

2 Yi-Mo Du et al.: CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems 29 inexpensive disks, RAID can help to construct largescale, high-performance, and high-reliability storage systems. The very broad application area of SSD has given rise to the idea of constructing storage systems using techniques that combine the advantages of the classic RAID and state-of-the-art SSDs. SSDs employ internal wear-leveling strategies to prolong their lifetime with their advertised capacity. When SSDs fail to meet their advertised equivalent capacity, they can no longer satisfy user requirements. Hence, wear leveling is very necessary in SSDs. Existing HDD-based RAID controllers without integrating the wear-leveling mechanism could not guarantee that all SSDs in RAID systems wear out evenly. When an SSD fails because of its life limit, too much time is needed to replace it and to reconstruct data using parity-based algorithms. To lengthen the life span of entire SSD-based RAID systems effectively, we propose a novel method adopting cross-ssd wearleveling (CSWL) among SSDs in RAID systems. When CSWL is used in RAID5, we call it CSWL-RAID5. When it is used in RAID6, we call it CSWL-RAID6. CSWL-RAID has three properties as follows: 1) Age-driven parity distribution. RAID4 assigns parity in unique device; RAID5 and RAID6 assign parity evenly which means every device has the same fraction of parity. Meanwhile, CSWL-RAID dynamically distributes parity according to the age of devices. When the erased number gap among some SSDs exceeds the previously assigned threshold, we need to reallocate parity by allocating more parity to younger SSDs and less parity to older SSDs. 2) Less replacement and reconstruction in the life cycle of entire RAID systems. By using CSWL in the entire SSD-based RAID system, every SSD is allocated part of the workload so that the entire RAID system, which consists of all devices, remains available for comparatively more time in approaching its rated life limit. Before that point, we are given enough time to back up all data using new devices and thereafter totally replace old ones. Consequently, in the life cycle of entire SSD-based RAID systems, less replacement and reconstruction are needed compared with previous systems without weal leveling. 3) Optimized data layout and addressing method with age-driven parity distribution. The conventional RAID adopts round-robin data layout [3], in which the mapping relationship can be represented through a simple function. Meanwhile, age-driven parity distribution makes addressing more complex. In this paper, we present and compare the original and optimized data layout and addressing method, which shows that the latter is much more effective. The rest of this paper is organized as follows. Section 2 presents the motivation of this paper through an analysis of previous studies. Section 3 describes our proposed design and related algorithms in detail. Section 4 evaluates the CSWL method in RAID systems. Section 5 introduces related work. Section 6 concludes this paper and makes prospect for the future work. 2 Problem Description 2.1 Why is CSWL Needed SSD has asymmetrical performance in responding to read and write requests. Hence, it is more suitable for read-intensive applications, such as the query systems of databases and search engines. As SSDs are more expensive than traditional HDDs, in large-scale data centers, SSD-based RAID systems often serve as caches of read-intensive applications in the critical I/O path. SSDs employ internal wear leveling to meet their advertised life span. However, Diff-RAID [4] figures out that cross-ssd wear-leveling mechanism will lead to high probability of correlated failures. Thus, Diff-RAID attempts to create and maintain age differences among SSDs to guarantee that at least some devices have lower bit error rates and consequently avoid a high correlated failure rate. We posit that it would be very useful to ensure that bit error rates gradually increase during the entire lifetime of flash chips. For Single Layer Cell (SLC) models, bit error rates do not have a linear relationship to age; they are even almost zero until flash chips reach their rated lifetime. For most Multi-Level Cell (MLC) models, bit error rates increase sharply shortly after the rated lifetime of flash chips is reached, and some rates start to increase sharply even earlier. Before they hit their rated lifetime, flash chips can maintain comparatively stable bit error rates. In addition, with the correctness of Error Correcting Code (ECC), this increasing trend can be slowed down further [5]. To keep age differences while the oldest SSD retires, Diff-RAID has to replace the retired SSD with a new one, and thereafter reconstruct data and redistribute parity. We use (1) and (2) [6] to approximate the reliability of the SSD-based RAID5 and RAID6 systems. In these two equations, N is the number of devices. Mean time to data loss (MTTDL) is used to measure system reliability. MTTF means mean time to failure of a single device, and MTTR means mean time to repair a failed device. From (1) and (2), we see that if the procedure of reconstructing data and redistributing parity is complex and very time consuming, data loss is most likely to occur because any device failing out of failure tolerate number at this moment could cause data corruption. If we use wear leveling among SSDs,

3 30 J. Comput. Sci. & Technol., Jan. 2013, Vol.28, No.1 we can prolong the endurance of the entire system and thereby reduce the number of replacement to avoid more windows of vulnerability in reliability. MTTDL = MTTDL = MTTF 2 N(N 1)MTTR, (1) MTTF 3 N(N 1)(N 2)MTTR 2. (2) Furthermore, as every update to data stripes causes the modification of related parity stripes, SSDs with more proportion of parity usually suffer from more updates than those with less parity. Hence, CSWL could benefit performance by maintaining better load balance of write traffic for transferring parity from heavy-load SSDs to light-load ones. 2.2 Why Not Even Parity Distribution The foregoing discussion proves that in most occasions, CSWL in SSD-based RAID systems is useful and necessary. Parity is the key factor affecting cross-ssd wear distribution because devices allocated more parity would wear out faster. Thus, traditional RAID mechanisms adopting the even parity distribution scheme, such as RAID5 and RAID6, may work well from the aspect of system-level wear leveling. However, we find that RAID5 and RAID6 cannot ensure wear leveling among devices under some workloads by experiments conducted with the trace-driven simulator described in [7]. Fig.1 shows the wear distribution adopting RAID5 and RAID6 with different traces. The traces selected are described in Subsection 4.1. In the SSD simulator, we set a counter, which moves when an erasure is met. After running the trace, the total counter number on each device is taken as the wear degree. Fig.1 shows that RAID5 and RAID6 have comparatively even wear distributions only under the RealAppPC workload. Age shuffle exists under other workloads because some workloads accessing some certain parity more often than others; devices with this parity suffer from more updates and wear out faster compared with others. 2.3 Why Not Other Schemes There are many studies on the internal wear-leveling mechanism, which have useful implications for realworld SSD applications [8]. However, research on cross- SSD wear leveling in RAID systems is still new. So far, the subject has been tackled only in [4] and [9]. The Diff-RAID presented in [4] was discussed in Subsection 2.1. A brief design of cross-ssd wear leveling for the SSD-based RAID5 is given in [9]. The design uses a big Fig.1. Erased number of each SSD. (a) For RAID5. (b) For RAID6. Both RAID5 and RAID6 consist of four SSDs. table to restore the erased number of stripes in each SSD. When parity hits the previously set number, the hot parity is exchanged with the cold parity with the greedy algorithm. This method costs much extra space, and the greedy algorithm is so complex that performance is restricted seriously. CSWL in RAID systems can balance wear grade among devices based on parity distribution to prolong endurance and improve performance. In addition, it is simple to implement and has reasonable time and space costs. 3 CSWL-RAID 3.1 Basic Principle of CSWL CSWL adopts age-driven parity distribution to make all SSDs in RAID systems wear out evenly. The underlying CSWL principle is based on the fact that parity suffers from more updates compared with common data because each data update results in corresponding updates to related parity in the same stripe. Thus, we can change the wearing rate of some SSDs by dynamically adjusting the fraction of parity on them. Fig.1 shows

4 Yi-Mo Du et al.: CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems 31 that under some workloads, adopting the even parity distribution scheme, such as in the cases of RAID5 and RAID6, cannot completely guarantee wear leveling among SSD devices; it may even cause a significant imbalance in ages. To solve this problem, we propose CSWL. SSD supplies interface to check for age. Hence, we can call this kind of application programming interface (API) to obtain age information from all devices and thereafter decide on a parity distribution strategy by evaluating the wear situation. Fig.2 shows the process by which CSWL creates age balance among devices given huge age shuffles in RAID5 and RAID6 configurations under some workloads. Fig.2. Basic principle of CSWL from age shuffle to the leveling wear. In the case of RAID5, age shuffle is slightly obvious under the workload Iozone. By periodically collecting age information from all SSDs, age shuffle is easily discovered. Once the age gap exceeds the expected value, the parity redistribution process would be called. More parity is allocated to younger SSDs, and less parity is allocated to older SSDs. The exact parity distribution fraction is determined according to the age of devices (see Subsection 3.2). When the system runs for a while, the devices approach the even wear level. If the access pattern of the workload does not change so frequently to exceed the set value, the original parity distribution strategy is maintained to avoid the high cost of parity redistribution. When the age gap exceeds the expected value again because of changes in application patterns or something else, parity redistribution must be called again. By adjusting the period and expected value, we can precisely control the granularity of wear leveling. Because wearing rate represents the density of responding to write requests, CSWL can improve performance by ensuring better load balance for write traffic. 3.2 Practical Architecture of CSWL-RAID As a kind of cross-ssd wear-leveling mechanism, CSWL can be embedded in parity-based RAID systems, such as RAID4, RAID5, and RAID6. RAID5 is most widely used in one failure-tolerate RAID system. CSWL-RAID5 presents a trade-off between RAID4 and RAID5. It neither allocates parity totally to one device (like RAID4 does) nor distributes parity to all devices evenly (like RAID5 does). It distributes parity dynamically according to the age of devices. RAID6 also adopts the even parity distribution scheme, but it has two groups of independently computed parity that can tolerate two cases of device failure. Hence, CSWL can be embedded in these RAID systems. They have similar architecture in practical implementation. Fig.3 illustrates the architecture of CSWL-RAID. CSWL-RAID has two controllers: a RAID controller, which manages a group of running SSDs called active devices as shown in Fig.2 to offer service; and a migration controller, which is triggered when the entire system approaches the end of its lifetime to migrate data from active devices to prepared ones and replace old devices with prepared ones. After replacement, prepared devices become the active ones and new devices are brought in as prepared ones. Because wear leveling among SSDs is incorporated in RAID mechanism, all SSDs in RAID systems can maintain the similar wear degree and can be totally replaced in one time. If any of the SSDs fail before their life limits, the corresponding prepared SSDs will be used for replacement at once and the data reconstruction process will be triggered. Fig.3. Architecture of CSWL-RAID. Prepared devices in a dashed line frame are not in the system all the time. They are plugged in only when needed. Both control flow and data flow can be seen in Fig.3. The RAID controller administrates active devices below it and connects with the migration controller to activate it when active devices nearly approach their life limits. Thereafter, the migration controller begins to migrate data from active devices to prepared ones. In most cases, data flow merely through the RAID controller to supply service for users. The RAID controller

5 32 J. Comput. Sci. & Technol., Jan. 2013, Vol.28, No.1 implements the basic RAID mechanism and our proposed parity distribution schemes. The migration controller does not need much complex hardware because it only provides the function of migration, which is copying data from old devices to the same address in new ones. Migration usually proceeds when the system is idle to avoid competition with I/O requests. Thus, the migration overhead can be latent and transparent for users. 3.3 Data Layout For integrity and continuity of our description, we first comprehensively describe the CSWL method in RAID5 cases where the failure of one device is tolerable. We focus on its data layout and addressing method. After knowing the CSWL process, we will extend it to RAID6. Similar to Diff-RAID, CSWL- RAID employs the dynamic age-driven parity distribution strategy [4]. We describe this relationship between parity distribution and age distribution quantitatively. Given the age of a RAID system (which can tolerate one device failure) that consists of n SSDs represented by the N-tuple column vector A(A 1, A 2,..., A n ) wherein A 1, A 2,..., A n are not prime numbers, we compute for variance (S) to evaluate the age difference. If S exceeds the critical value (CA), the process of parity redistribution must be called to maintain a similar wear grade across the entire RAID system. The parity distribution represented by P (P 1, P 2,..., P n ) can be made according to the age distribution through two steps: 1) sort A 1, A 2,..., A n in descending order; and 2) P k equals the k-th value in the descending order based on age. Fig.4 presents the basic data layout of CSWL- RAID5, where DN is the device number identifying each device. At first, we suppose that the devices are new and their age distribution is (1, 1, 1, 1) as shown in Fig.4(a). For wear leveling, we make the parity distribution (1, 1, 1, 1). Actually, it is the RAID5 scheme assigning parity evenly across all devices. However, as discussed in Subsection 2.2, RAID5 cannot ensure wear leveling completely among SSDs. After a running period, an age gap among SSDs appears; the age distribution becomes (3, 3, 3, 1). The age difference can be described in terms of variance (S = 3). If S exceeds CA, parity redistribution must be called to mitigate the age difference: more parity is allocated to younger devices and less parity is allocated to older devices. Fig.4(b) illustrates the data layout of the new parity distribution (1, 1, 1, 3) made according to age distribution. After another running period, the age distribution becomes (2, 2, 1, 1) with S = 1. Compared with the last variance, the gap significantly decreases. Suppose S still exceeds DN = 0 P P P DN = 1 0 P P P DN = P P P 35 DN = P P P (1, 1, 1, 1) (a) DN = 0 P P DN = 1 0 P P DN = P P DN = P P P P P P (1, 1, 1, 3) (b) DN = 0 P P DN = 1 0 P P DN = P P P P DN = P P P P (1, 1, 2, 2) (c) Fig.4. Basic data layout of CSWL-RAID5. (a) Data layout under parity distribution (1, 1, 1, 1). (b) Data layout under parity distribution (1, 1, 1, 3). (c) Data layout under parity distribution (1, 1, 2, 2). CA, then parity redistribution must be called again. Fig.4(c) shows the data layout given the corresponding parity distribution. In Fig.4, the data layout adopts the round-robin striping scheme that has simple addressing policy but causes huge migration once the parity distribution scheme gets changed. Fig.5 presents an improved data layout. With this data layout, every parity redistribution operation brings a small amount of shifts between data and parity. The procedure of parity shift from an original distribution to a new distribution is as follows: 1) Compute the region number. Region number is determined by computing the minimum common multiple of the last region and the sum of each fraction in the new distribution. The first region is the sum of each fraction in the original parity distribution. From Fig.5(a) to Fig.5(b), we can compute the region number. We compute the lowest common multiple (LCM) of 4 (sum of (1, 1, 1, 1)) and 6 (sum of (1, 1, 1, 3)) to obtain a region number equal to 12. 2) Amplify the fraction in each part of the parity distribution equation according to the region number. The parity distribution is changed from (1, 1, 1, 1) shown in Fig.4(a) to (3, 3, 3, 3) shown in Fig.5(a), and from (1, 1, 1, 3) shown in Fig.4(b) to (2, 2, 2, 6) shown in Fig.5(b).

6 Yi-Mo Du et al.: CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems 33 Fig.5. Improved data layout of CSWL-RAID. (a) Data layout under parity distribution (1, 1, 1, 1). (b) Data layout under parity distribution (1, 1, 1, 3). (c) Data layout under parity distribution (1, 1, 2, 2). 3) Exchange the parity and data block in the corresponding area according to the newly computed parity distribution. Compared with the basic data layout, the improved data layout migrates less data. Although it cannot guarantee parity distribution according to age evenly in the finest grain, it is quite uniform from the perspective of the whole layout. 3.4 Addressing Method The key to implement CSWL-RAID is designing the mapping mechanism between logical and physical addresses in the controller. When the controller receives an I/O request, it uses the striping scheme to partition the data into several parts. Thereafter, it sends data and parity to certain related devices according to the given mapping relationship. The round-robin placement scheme is popularly used in RAID systems. In this method, data layout is ascertained beforehand. Hence, any logic block address can be mapped to the physical address easily through a function without looking up. However, this method lacks a little flexibility. The other method, which is seldom used, is the adoption of a mapping table. It is more flexible, but it leads to high time and space costs. Usually, dynamic parity distribution, such as in the case of CSWL-RAID, needs a very flexible mapping data structure like a mapping table. In this paper, we use the former method, which still meets our requirements. The traditional RAID scheme that can tolerate the failure of one device distributes parity either to a dedicated device (like RAID4 does) or across all devices evenly (like RAID5 does). From their different data layouts, we can compute the physical address using a linear function. The addressing function of RAID4 can be summarized as follows: SN = LBA/(N 1), PN = N 1, DN = LBA mod(n 1), (3) where LBA is the logical block address of the data unit after partition; N is the number of devices, including data devices and parity devices; SN is the stripe group number representing the stripe group allocated for the data; PN is the number of devices that store parity related with current data; and DN is the number of devices that store current data. The addressing function of RAID5 is shown as (4). Under different time periods, CSWL-RAID5 has different parity distributions because of age differences

7 34 J. Comput. Sci. & Technol., Jan. 2013, Vol.28, No.1 among devices. The age of every device can be denoted by the average age of all blocks. If we use the data layout shown in Fig.4, the address function can be stated as (5). SN = LBA/(N 1), PN = SN mod N, LBA mod (N 1) + 1, if LBA mod (N 1) PN, DN = (4) LBA mod (N 1), if LBA mod (N 1) < PN, SN = LBA/(N 1), 0, if SN mod (p 1 + p p n ) p 1 1, 1, if p 1 SN mod (p 1 + p p n ) p 1 + p 2 1, PN =. N 1, if p 1 + p 2 + p n 1 SN mod (p 1 + p p n ) p 1 + p p n 1, LBA mod (N 1) + 1, if LBA mod (N 1) PN, DN = (5) LBA mod (N 1), if LBA mod (N 1) < PN. We have pointed out the drawback of the data layout shown in Fig.4. We now present an improved data layout to reduce the number of data migration. With an improved data layout, we can use the algorithm depicted in Algorithm 1 (Fig.6) to solve the addressing problem. This addressing algorithm is run when redistribution is completed according to the layout presented in the last subsection. In the algorithm, parity redistribution history and region history must be recorded as permanent variables. SN can be computed directly like in the case of all previous RAID mechanisms. Because of the exchange between parity and data when parity redistribution occurs, using only one function cannot meet the requirements of addressing PN and DN. Hence, we reserve all region history and parity distribution history to attain the current physical address of parity and common data through the algorithm described in Algorithm 1. There is another way of dealing with the addressing problem. We can use a table to reserve the number of devices to which the data and correlated parity should be sent. By looking up the table, we can get the physical address of any logical block. However, the operation of looking up may cost too much time, and the table Algorithm 1. Address (LBA, t, p[t][n], region[t], SN, PN, DN) Input: LBA: logical block address t: redistribution times p[t][n]: parity distribution history region[t]: region history Output: stripe group number PN: parity device number DN: data device number 1. SN = LBA/(N 1) 2. if (t = 0) //no parity redistribution happens 3. PN = SN %N 4. if LBA%N 1 PN then 5. DN = LBA%(N 1) else 7. DN = LBA%(N 1) 8. else 9. mmn = minimum multiple number(region[t], region[t 1]) 10. x = mmn/region[t 1], y = mmn/region[t] 11. for (j = 0; j < N; j + +) //amplify the fraction in parity distribution expression 12. d[j] = x p[t 1][j] y p[t][j] 13. for (j = 0; j < N; j + +) 14. if (d[j] > 0) 15. for (k = N 1; k 0; k ) 16. if (d[k] < 0) 17. if (PN == j) 18. PN = k 19. if (DN == k) 20. DN = j 21. Address (LBA, t 1, p[t 1][N], region[t 1], SN, PN, DN) Fig.6. Algorithm 1. must be huge enough to accommodate all mapping relationships, causing extreme space pressure on the RAID controller. What we propose is more effective than this mapping table method. The redistribution time is very short because it is only called when the age difference exceeds the critical value which is usually a little high. The algorithm described in Fig.6 costs less time and absolutely saves space because it involves no complex data structure. 3.5 Extending CSWL to RAID Basic Principle of CSWL-RAID6 We have studied using CSWL in RAID systems that can tolerate the failure of one device. In this section, we examine whether CSWL can be extended to

8 Yi-Mo Du et al.: CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems 35 RAID6 cases where the failure of two devices is tolerable. RAID6 adopts two kinds of independent parity computing algorithms that support data recovery when any two devices fail simultaneously. It has two types of parity information, which are represented by P and Q in this paper. The fraction of parity P denoted by vector p is associated with the fraction of parity Q denoted by vector q as follows: (q 1, q 2,..., q n ) = (p n, p 1,..., p n 1 ). (6) For convenience of describing CSWL in RAID6, we use the case previously utilized for RAID5. Given the age distribution A (A 1, A 2,..., A n ), we compute the total parity distribution vector P (P 1, P 2,..., P n ) as before. The parity distribution of RAID6 includes two parts, and we use (7) below to express their association. Because elements in vector P are relatively prime, k in the equation is a positive integer. (p 1 +q 1, p 2 +q 2,..., p n +q n ) = k (P 1, P 2,..., P n ). (7) The linear equations with N unknown numbers can be shown as follows from joint (6) and (7). p 1 + p n = k P 1, p 2 + p 1 = k P 2, (8). p n + p n 1 = k P n. We assume that the coefficient matrix of (8) is represented by M. Meanwhile, p is the solution and P is our expected total parity distribution computed according to the age of devices. (8) can be transformed into a simple expression as follows: M p = k P. (9) There is the possibility that (9) does not have a solution. Only when the rank of the coefficient matrix equals the rank of its augmented matrix will there be a valid solution. We can know whether the function has a solution by determining whether the following equation is tenable: rank(m) = rank(m, P ). (10) If (9) does not have a solution, we cannot get the parity distribution of p. This means that the total parity distribution according to the age situation does not match the association represented by (6). When the equation has a solution, the parity distribution of P can be clearly presented. However, when the equation has no solution, we should find an approximate solution to make P p approach k P as close as possible. We are given the problem of finding the solution to make the function below have the minimum value. F (p) represents the second formula of the vector M p k P. F (p) = M p k P 2. (11) Finding the parity distribution of P can be seen as finding the solution vector p to obtain the minimum value of F (p). The restriction condition is that all elements in vector p are non-negative integers. It is an open problem in the mathematical field, and some algorithms have been proposed to offer solutions [10]. When F (p) has the minimum value of zero, it means that (9) has a solution. Given that matrix M has a small number of rows, usually less than 10, we need a short time to compute for parity distribution. After the parity distribution of P is computed, the parity distribution of Q can be computed using (6) Data Layout of CSWL-RAID6 Figs. 7 and 8 present the basic data layout and improved data layout of CSWL-RAID6, respectively. The data layout of CSWL-RAID6 is similar to that of CSWL-RAID5. The only difference is that every exchange in parity P causes an exchange in related parity Q. So we simplify the instruction by presenting parity redistribution only once. We assume that the age distribution is (3, 3, 3, 1) as described in Subsection 3.3, so the parity distribution is (1, 1, 1, 3) computed through the method presented in Subsection 3.3. We make P equals (1, 1, 1, 3), substitute it in (11), and get the parity distribution (0.34, 0.15, 1.35, 1.54), which approximately equals (2, 1, 8, 9) for convenience of computing. The total parity distribution approximately equals (4, 1, 3, 6), which is slightly different from the expected parity distribution (1, 1, 1, 3). Although it cannot always completely achieve our expectation, to some degree, it is closest to our expectation. Furthermore, when (9) has a solution, it is our expected parity distribution. Thus, as the time of parity redistribution increases, the wear leveling effect is finally achieved Addressing Method of CSWL-RAID6 Before discussing the addressing method of CSWL- RAID6, we first present the addressing method of the standard RAID6. Analogy with data layout, addressing method of CSWL-RAID6 is also similar to CSWL-RAID5. With a basic data layout, the addressing function of CSWL- RAID6 is as follows: SN = LBA/(N 2), PN = SN mod N, QN = (PN + 1) mod N,

9 36 J. Comput. Sci. & Technol., Jan. 2013, Vol.28, No.1 DN = 0 P 2 4 Q P Q P Q P Q P Q DN = 1 Q P 5 6 Q P Q P Q P Q P DN = 2 0 Q P 7 8 Q P Q P Q P Q P 39 DN = Q P 9 11 Q P Q P Q P Q P (1, 1, 1, 1) (a) DN = 0 P P Q Q Q Q Q Q Q Q Q DN = 1 Q Q P DN = Q P P P P P P P P DN = Q Q Q Q Q Q Q Q P P P P P P P P P (2, 1, 8, 9) (b) Fig.7. Basic data layout of CSWL-RAID. (a) Data layout under parity distribution (1, 1, 1, 1). (b) Data layout under parity distribution (2, 1, 8, 9). Fig.8. Improved data layout of CSWL-RAID. (a) Data layout under parity distribution (1, 1, 1, 1). (b) Data layout under parity distribution (2, 1, 8, 9). 0, if SN mod (p LBA mod (N 2) + 2, 1 + p p n ) p 1 1, if LBA mod (N 2) PN, 1, if p 1 SN mod (p 1 + p LBA mod (N 2) + 1, DN = PN = p n ) p 1 + p 2 1, if QN LBA mod (N 2) < PN, (12). LBA mod (N 2), N 1, if p 1 + p 2 + p n 1 SN mod (p 1 + if LBA mod (N 2) < min{pn, QN }, p p n ) p 1 + p 2 + p n 1, SN = LBA/(N 2), QN = (PN + 1) mod N,

10 Yi-Mo Du et al.: CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems 37 LBA mod (N 2) + 2, if LBA mod (N 2) PN, LBA mod (N 2) + 1, DN = if QN LBA mod (N 2) < PN, LBA mod (N 2), if LBA mod (N 2) < min{pn, QN }. (13) The addressing method of CSWL-RAID6 with improved data layout is described in Fig.9. It is similar to Algorithm 1. However, there are some differences. Because RAID6 has two kinds of parity, we need to consider another parity (i.e., Q) when computing stripe number and make exchange between parity and data. 4 Evaluation 4.1 Experimental Setup Our experiment environment is a PC system with a 2.33 GHz Intel Core TM 2 Quad Q8200 CPU and 4 GB main memory on board. The operating system is Ubuntu 8 with kernel version We measure the performance, endurance, and reliability of CSWL- RAID compared with other SSD-based RAID systems. When flash-based SSD is sold, the current market hides the characteristics of accessing flash memory; it also hides the internal implementation mechanism. So we cannot totally confirm that wear situation checked by API is really computed from the average erased number of all blocks. That is why we developed a tracedriven SSD simulator [7] extended from Disksim. It has the function of supplying the response time of each I/O and checking the age of each block. To compare performance and endurance between CSWL-RAID and other RAID systems, we implement these RAID schemes in software RAID mode. Thereafter, we run traces collected from two standard benchmarks (Iozone and Postmark) and two real applications (PC under real workloads named RealAppPC and servers providing common ftp application named RealAppSE) to evaluate CSWL completely. Iozone generates and measures various file operations, including read, write, re-read, re-write, read backwards, read strided, fread, fwrite, random read/write, pread/pwrite variants, aio read, aio write, and mmap. This workload has many sequential writes. Postmark is a widely used I/O subsystem benchmark. It creates 100 directories and files, performs transactions (reads and Algorithm 2. Address (LBA, t, p[t][n], region[t], SN, PN, QN, DN) Input: LBA: logical block address t: redistribution times p[t][n]: parity distribution history region[t]: region history Output: SN: stripe group number PN: parity p device number QN: parity q device number DN: data device number 1. SN = LBA/(N 2) 2. if (t = 0) //no parity redistribution happens 3. PN = SN %N 4. QN = (PN + 1)%N 5. if LBA%(N 2) PN then 6. DN = LBA%(N 2) if LBA%(N 2) < PN && LBA%(N 2) QN then 8. DN = LBA%(N 2) else 10. DN = LBA%(N 2) 11. else 12. mmn = minimum multiple number(region[t], region[t 1]) 13. x = mmn/region[t 1], y = mmn/region[t] 14. for (j = 0; j < N; j + +) //amplify the fraction in parity distribution expression 15. d[j] = x p[t 1][j] y p[t][j] 16. for (j = 0; j < N; j + +) 17. if (d[j] > 0) 18. for (k = N 1; k 0; k ) //exchange data and parity 19. if (d[k] < 0) 20. if (PN == j) 21. PN = k 22. QN = (k + 1)%N 23. if (DN == k) 24. DN = j 25. Address (LBA, t 1, p[t 1][N], region[t 1], SN, PN, QN, DN) Fig.9. Algorithm 2. writes) to stress the file system, and finally deletes files. This workload features small but intensive random data accesses. RealAppPC is collected from PC running real applications, such as Web and , for Parallel data project: Disksim. Dec IOzone filesystem benchmark. Dec Freshports Postmark Dec

11 38 J. Comput. Sci. & Technol., Jan. 2013, Vol.28, No.1 72 hours. RealAppSE is collected from our school ftp server providing 24-hour daily ftp service. Through the simulation experiment, we can attain average latency, which can be used as a metric of performance. After running the traces, we check the average erased number of each block in each SSD, which represents the endurance of the entire RAID system. For reliability, we adopt a mathematical model to analyze the MTTDL, which is a common reliability criterion. 4.2 Performance of CSWL-RAID The performance experiment is conducted at a state when redistribution is completed and data layout is stable. The time cost of the addressing method in the improved data layout, which is not very long, is positively related only with parity redistribution history, so we can ignore the extra time cost on the part of software RAID. Fig.10 shows the average latency for different RAID systems under different kinds of workloads. and RAID6, although RAID5 and RAID6 distribute parity absolutely evenly. Thirdly, CSWL method in RAID6 although performs a little inferior than it does in RAID5 because the CSWL method cannot precisely control parity distribution. It can also improve performance by approximately 15% because it adopts age-driven parity distribution. Fourthly, Diff-RAID distributes more parity to older devices to create an age gap and improve reliability. However, it makes older devices, which have already been responsible for many requests, undertake more incoming requests. This aggravates load imbalance among SSDs, leading to a bottleneck in some devices. Consequently, the performance of Diff-RAID is extremely affected, making it the worst among the three RAID5 systems. Parity redistribution is necessary in CSWL-RAID and Diff-RAID. However, this procedure costs so much time that the performance decreases remarkably. Diff- RAID does not provide detailed implementation of parity redistribution. Without any improvement on parity migration, it will incur too much time. Fig.11 displays the parity redistribution time of CSWL-RAID under different data layouts with and without optimization. Because parity redistribution time in RAID6 is three times more than that in RAID5, we express the redistribution time by drawing two figures respectively to Fig.10. Average latency of different RAID systems under various traces. We can see from Fig.10 that some performance characteristics are exhibited as follows. Firstly, RAID6, including standard RAID6 and CSWL-RAID6, is very much outperformed by RAID5, including standard RAID5, Diff-RAID, and CSWL-RAID5. The difference in performance is accounted by two kinds of parity in RAID6, which lead to additional update costs. Secondly, CSWL is very useful in reducing average latency in RAID systems, both in RAID5 and RAID6. CSWL-RAID5 is better than the two other systems; it outperforms RAID5 by about 10% and Diff-RAID by about 30% and sometimes even by 40%. The device with a higher wear grade suffers from more writes, so transforming some parity from older devices to younger devices, to some degree, balances the write workload. This explains why CSWL-RAID outperforms RAID5 Fig.11. Redistribution time of (a) CSWL-RAID5 and (b) CSWL- RAID6 with unimproved data layout and improved data layout under various traces.

Yi-Mo Du et al.: CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems 39 clearly show the difference between the basic data layout and the improved data layout.

12 Yi-Mo Du et al.: CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems 39 clearly show the difference between the basic data layout and the improved data layout. The improved layout has much smaller overhead. In Subsection 3.2, we present the architecture of CSWL-RAID. When prepared devices approach their life limits, migration must be trigged. For users good experience, migration often proceeds when the system is idle. Thus, migration overhead can be latent and transparent for users. And migration happens few times. So, we do not need to specifically make evaluation for migration cost here. 4.3 Endurance of CSWL-RAID After running each trace, we collect the erased number of each block in each SSD. We use the average erased number of total blocks in each SSD to represent the age of each SSD. Fig.12 displays the age difference of devices under different RAID schemes and different traces. market development of flash. In addition, there are no dependable reliability models for SSDs. For traditional disks, some studies on failure models have been conducted, and various reliability models, such as the Markov model [11] and the simulation and shortcut method [12], have been proposed. This paper analyzes the reliability of SSD-based RAID using the mathematical method. Before we measure the reliability of SSD-based RAID, we must first understand the reliability of a single flash chip. Fig.13 shows the relationship between Raw bit error rate (RBER) and age for MLC flash. RBER can be reduced significantly with ECC to obtain what is called uncorrectable bit error rate (UBER). UBER is some orders of magnitude lower than RBER because ECC on every page can correct most errors. Fig.12. Age difference of devices under different RAID schemes and various traces. We use the standard deviation of the erased number of all SSDs to evaluate the age difference of devices. RAID5 and RAID6 have comparatively even wear distribution only under the workload RealAppPC, which has been discussed in Subsection 2.2. Hence, we introduce CSWL in RAID systems. Apparently, CSWL has a strong effect, making all devices wear evenly as seen in Fig.12. Although CSWL in RAID6 does not control wear leveling as well as it does in RAID5, it is still much better than other schemes on device-level wear leveling. CSWL-RAID has the most uniform age distribution, suggesting that the endurance of the entire CSWL-RAID system is prolonged. 4.4 Reliability of CSWL-RAID Currently, reliability is a tough problem for flashbased SSDs. Reliability constrains the pace of the Fig.13. Wear out and bit error rate for MLC flash device [5]. We use data loss possibility (DLP, the reciprocal of MTTDL) as the metric for estimating reliability. Since devices in RAID5 and RAID6 have quite different ages, it is difficult to give a value to evaluate their reliability. The DLP of Diff-RAID converges to steady state after several replacements. So we can attain the DLP of Diff- RAID from [4]. AS CSWL-RAID aims to wear-leveling among SSDs, devices in CSWL-RAID have similar age. So we can assume the failure rates of all devices are the same. If the other precondition that the reliability of each device is constant succeeds, we can use (1) and (2) referred in Section 2 to evaluate CSWL-RAID s reliability. For disk-based RAID, we can use (1) and (2) to estimate the reliability. When disks are replaced by SSDs, the failure rate of each device can no longer be treated as a constant. For SLC flash, RBER is stable before flash approaches its rated lifetime. Especially, it is almost changeless with ECC. For MLC, RBER increases along with increasing age. However, the change

13 40 J. Comput. Sci. & Technol., Jan. 2013, Vol.28, No.1 is not obvious when it is far away from its rated lifetime, and ECC narrows down the speed of change. CSWL- RAID migrates data from old devices to prepared ones before they reach their life limits. Hence, no device whose age approaches its life limit is in service of storing data. Based on this point, we can use a constant to approximate the failure rate. We can therefore use (1) and (2) to evaluate the reliability of CSWL-RAID. When we assume the reliability of flash memory a little far from its rated life as a constant, we can compute the reliability of CSWL-RAID. Given the UBER of flash used in Diff-RAID, the DLP of CSWL-RAID5 and CSWL-RAID6 can be computed. We list them in Table 1. From Table 1, we apparently see that CSWL- RAID6 has much higher reliability than CSWL-RAID5 because RAID6 has more reliable configuration. Table 1 also shows that the DLP of CSWL-RAID approaches that of Diff-RAID5. Table 1. Reliability of Various RAID Schemes RAID Schemes Diff-RAID CSWL-RAID5 CSWL-RAID6 5 Related Work Data Loss Possibility 1e 06 5e 06 5e 12 RAID can improve the performance and reliability of flash-based storage systems. We classify flash storage systems into two categories according to the grain of device applied with RAID mechanism. The first category, coarse grain, uses RAID mechanism on SSDs. This mode is completely compatible with previous RAID technology, including software and hardware. However, although solid state disks hide the difference between flash and hard disks through the Flash translation layer (FTL), the potential advantages of flash are also hidden. The second category, fine grain, which uses RAID on flash chips, can fully exploit the characteristics of flash. Mao et al. [13] have designed a hybrid RAID4 named HPDA, which is composed of a disk and SSDs. The design uses a disk to serve as a dedicated parity device in response to write-intensive accesses; this covers the shortcoming of SSDs. HPDA has better performance and reliability compared with RAID4, which is fully composed of SSDs. However, it has the drawback of traditional RAID4; the parity device is the bottleneck. Kadav et al. [4] have proposed a novel SSD-based RAID called Diff-RAID, which has been discussed in detail in Section 2. Meanwhile, some studies have focused on constructing RAID with flash chips. Lee et al. [14] has designed a flash-aware redundancy array (FRA), which separates parity updating from the writing procedure and deals with it when the system is idle. This lazy parity updating method improves write performance significantly. Chang et al. [15] have proposed a self-balanced flash array, which encodes hot data into several replicas stored on different banks. Thus, requests on hot data can be directed to cold banks that are responsible for fewer requests. However, every update results in the modification of parity, thereby extremely affecting write performance and causing high space cost. Soojun et al. [16] have presented a flash-aware RAID technique for dependable and high-performance flash memory SSDs. It delays parity update that accompanies every data write to decrease parity handling overhead. 6 Conclusions and Future Work In this paper, we proposed a cross-ssd wear-leveling method in SSD-based RAID systems. The method uses a dynamic parity distribution scheme, which adapts to SSD age. Given that parity stripes suffer from more modification, we allocate more parity to younger SSDs and less parity to older SSDs, thereby ensuring wear leveling among SSDs and to some degree alleviating imbalance in the write loads of devices. We also provided the data layout and addressing algorithm of CSWL- RAID in detail. To implement our method, we need only a simple data structure to record a small amount of redistribution information history. Thereafter, we extended the research on using CSWL in RAID6 and presented an effective extension method. Through an experiment, we proved that CSWL-RAID has an even wear distribution with insignificant overhead and that it outperforms other RAID systems in terms of average response time. The reliability of CSWL-RAID5 is comparable with that of Diff-RAID. This paper discusses the use of CSWL in SSD-based RAID systems. In the future, we intend to extend the use of CSWL in flash chip-based RAID systems [17]. This means that we need to implement CSWL on an SSD controller combined with FTL. Dealing with two kinds of address mapping, specifically in FTL and RAID, would be challenging. We believe that promoting performance and prolonging the life span of SSDs will be very effective when CSWL is implemented on an SSD controller. References [1] Chen F, Luo T, Zhang X D. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proc. the 9th FAST, Feb. 2011, pp [2] Patterson D, Gibson G, Katz R H. A case for redundant arrays of inexpensive disks (RAID). In Proc. the 1988 SIGMOD, June 1988, pp

Yi-Mo Du et al.: CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems 41 [3] Zhen W M, Zhang G Y. FastScale: Accelerate RAID scaling by minimizing data migration. In Proc.

[5] Grupp L M, Caulfield A M, Coburn J, Swason S, Yaakobi E, Seigel P H, Wolf J K. Characterizing flash memory: Anomalies, observations, and applications. In Proc. the 42nd MI- CRO, December 2009, pp.

[7] Du Y M, Xiao N, Liu F, Chen Z G. A customizable and modular flash translation layer (FTL) design and implementation. Journal of Xi an Jiaotong University, 2010, 44(8): 42-47.

Reliability and performance enhancement technique for SSD array storage system using RAID mechanism. In Proc. the 9th ISCIT, Sept. 2009, pp.140-145. [10] Lee D D, Seung S H.

14 Yi-Mo Du et al.: CSWL: Cross-SSD Wear-Leveling Method in SSD-Based RAID Systems 41 [3] Zhen W M, Zhang G Y. FastScale: Accelerate RAID scaling by minimizing data migration. In Proc. the 9th FAST, February 2011, pp [4] Balakrishnan M, Kadav A, Prabhakaran V, Malkhi D. Differential RAID: Rethinking RAID for SSD reliability. In Proc. Eurosys, April 2010, pp [5] Grupp L M, Caulfield A M, Coburn J, Swason S, Yaakobi E, Seigel P H, Wolf J K. Characterizing flash memory: Anomalies, observations, and applications. In Proc. the 42nd MI- CRO, December 2009, pp [6] Thomasian A, Blaum M. Higher reliablity redundant disk arrays: Organizations, operation, and coding. ACM Transaction on Storage, 2009, 5(3): Article No.7. [7] Du Y M, Xiao N, Liu F, Chen Z G. A customizable and modular flash translation layer (FTL) design and implementation. Journal of Xi an Jiaotong University, 2010, 44(8): (In Chinese) [8] Gal E, Toledo S. Algorithms and data structures for flash memories. ACM Computing Surveys, 2005, 37(2): [9] Park K, Lee D H, Woo Y et al. Reliability and performance enhancement technique for SSD array storage system using RAID mechanism. In Proc. the 9th ISCIT, Sept. 2009, pp [10] Lee D D, Seung S H. Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 2001, 13: [11] Geist R, Trivedi K. An analytic treatment of the reliability and performance of mirrored disk subsystems. In Proc. the 23rd Inter. Symp. Fault-Tolerant Computing, June 1993, pp [12] Thomasian A. Shortcut method for reliability comparisons in RAID. Journal of Systems and Software, 2006, 79(11): [13] Mao B, Jiang H, Feng D, Wu S Z, Chen J X, Zeng L F, Tian L. HPDA: A hybrid parity-based disk array for enhanced performance and reliability. In Proc. IPDPS, April 2010, pp [14] Lee Y, Jung S, Song Y H. FRA: A flash-aware redundancy array of flash storage. In Proc. the 7th CODES+ISSS, October 2009, pp [15] Chang Y B, Chang L P. A self-balancing striping scheme for NAND-flash storage systems. In Proc. the 2008 ACM Symposium on Applied Computing, March 2008, pp [16] Im S, Shin D K. Flash-aware RAID techniques for dependable and high-performance flash memory SSD. IEEE Transactions on Computers, 2011, 60(1): [17] Xiao N, Chen Z G, Liu F, Lai M C, An L F. P3Stor: A parallel, durable flash-based SSD for enterprise-scale storage systems. Science China Information Sciences, 2011, 54(6): Yi-Mo Du got his B.S. degree of computer science and technology from Jilin University, China, in He achieved his M.S. degree of computer science and technology from National University of Defense Technology (NUDT), China, in Now he is pursuing his Ph.D. degree in NUDT. His current research interests include distributed file system, network storage and solid state storage system. Nong Xiao received the B.S., M.S. and Ph.D. degrees of computer science from NUDT, China. Now he is a professor in the State Key Laboratory of High Performance Computing, NUDT. His current research interests include large-scale storage system, network computing, and computer architecture. Fang Liu received the B.S. and Ph.D. degrees of computer science from NUDT, China, in 1999 and 2005 respectively. Now she is an associate professor in the State Key Laboratory of High Performance Computing, NUDT. Her current research interests include distributed file system, network storage and solid-state storage system. Zhi-Guang Chen got his B.S. degree of computer science and technology from Harbin Institute of technology in He achieved his M.S. degree of computer science and technology from NUDT in Now he is pursuing his doctor degree in NUDT. His current research interests include distributed file system, network storage and solid state storage system.

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0 NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented