Research on a new peer to cloud and peer model and a deduplication storage system

Size: px

Start display at page:

Download "Research on a new peer to cloud and peer model and a deduplication storage system"

Clementine Sharp
6 years ago
Views:

1 University of Wollongong Research Online University of Wollongong Thesis Collection University of Wollongong Thesis Collections 2011 Research on a new peer to cloud and peer model and a deduplication storage system Zhe Sun University of Wollongong Recommended Citation Sun, Zhe, Research on a new peer to cloud and peer model and a deduplication storage system, Master of Information Systems and Technology by Research thesis, University of Wollongong. School of Information Systems and Technology, University of Wollongong, Research Online is the open access institutional repository for the University of Wollongong. For further information contact Manager Repository Services: morgan@uow.edu.au.

3 RESEARCH ON A NEW PEER TO CLOUD AND PEER MODEL AND A DEDUPLICATION STORAGE SYSTEM A Thesis Submitted in Fulfilment of the Requirements for the Award of the Degree of Master of Information Systems and Technology by Research from UNIVERSITY OF WOLLONGONG by Zhe SUN School of Information Systems and Technology Faculty of Informatics 2011

5 CERTIFICATION I, Zhe SUN, declare that this thesis, submitted in partial fulfilment of the requirements for the award of Master of Information Systems and Technology by Research, in the School of Information Systems and Technology, Faculty of Informatics, University of Wollongong, is wholly my own work unless otherwise referenced or acknowledged. The document has not been submitted for qualifications at any other academic institution. (Signature Required) Zhe SUN 30 March 2011

6 Dedicated to My Parents Hui SUN and Wei HONG

7 Table of Contents List of Tables iv List of Figures/Illustrations vi Acknowledgements vi ABSTRACT vii Publications viii 1 Introduction Overview Major issues Motivations of this work Methodologies Method of improving data transmission efficiency Method of developing deduplicated storage system Summary of outcomes Structure of the thesis Literature Review Introduction Cloud Background Cloud computing Cloud storage Distribution file system Distributed file system Distributed database Distributed storage model Distributed coordinator storage system Files storage granularity System scalability Related work Existing deduplicated storage systems Method of identifying duplications Hash algorithm Study of cloud storage model i

8 TABLE OF CONTENTS ii 2.6 Summary P2CP: Peer To Cloud And Peer Introduction Existing distributed storage models Peer to peer storage model Peer to server and peer Cloud storage model A new cloud model: P2CP Comparison Comparison based on Poisson process Comparison based on Little s law Summary DeDu: A Deduplication Storage System Over Cloud Computing Introduction Identifying the duplication Storage mechanism Design of DeDu Data organization Storage of the files Access to the files Deletion of files Environment of simulation System implementation System architecture Class diagram Interface Summary Experiment Results and Evaluation Evaluations and summary of P2CP Evaluation from network model availability Evaluation from particular resource availability Summary of P2CP Performance of DeDu Deduplication efficiency Balance of load Reading efficiency Writing efficiency Evaluations and summary of DeDu Summary

9 TABLE OF CONTENTS iii 6 Conclusion and Future Work Conclusion Future Work References 86 A Appendix of DeDu s Code 87

10 List of Tables 4.1 Configuration of virtual machine Read efficiency Write efficiency without deduplication Write efficiency with deduplication iv

11 List of Figures 1.1 Digital data worldwide GFS Architecture [27] Table location hierarchy for BigTable [15] P2P Storage Model P2SP Storage Model Traditional Cloud Storage Model P2CP storage model Time for download Comparing download time The architecture of source data and link files Architecture of deduplication cloud storage system Collaborative relationship of HDFS and HBase[12] Data organization in DeDu Procedures for storing a file Procedures to access a file Procedures to delete a file Over View of Package System Implementation Commands Package Task Package Utility Package DeDu s Main Interface Configure HDFS Configure HBase After connecting to HDFS and HBase Uploading process Downloading process Deletion process Deduplication efficiency Static load balance Dynamic load balance v

12 Acknowledgements I would like to thank my supervisor Dr.Jun SHEN. With his patience and encouragement I can carry on my experiment and finish this thesis. I also want to thank Dr.Ghassan BEYDOUN, Professor Xiaolin WANG and Dr.Tania SILVER at University of Wollongong and Dr.Jianming YONG at University of Southern Queensland. Without their helps and advices, this thesis would not be possible. Secondly, I am particularly indebted to my labmates Xiaojun Zhang, Zhou Sun, Hongda Tian, Juncheng Cui, Xing Su Hongxiang Hu, and my friends, Di Song and Jinlin Li. Without thier great help and assistance, this thesis would not be possible. Last but not the least, I am also very grateful to my parents who have been encouaging me and always supporting me through the most dificult times of this work. Without their love and understanding, it is impossible to accomplish this work. vi

13 Research on a new peer to cloud and peer model and a deduplication storage system Zhe SUN A Thesis for Master of Information Systems and Technology by Research School of Information Systems and Technology University of Wollongong ABSTRACT Since the concept of cloud computing had been proposed, scientists started to conduct research on it. After Amazon supplied cloud services, lots of cloud techniques and cloud online applications have been developed. Recently, some traditional Web services change their service platforms to cloud platforms. Most of these Web services and applications concentrate on the functions of computing and storage. This thesis focuses on improving the performance of cloud storage service. In particular, this manuscript presents a new storage model named Peer to Cloud and Peer (P2CP). We assume that the P2CP model follows the Poisson process or Littles law and have proved that the speed and availability of P2CP is better than the pure Peer to Peer (P2P) model, the Peer to Server and Peer (P2SP) model, and the cloud model by mathematical modeling. The main features of P2CP is that there are three data transmissions in the storage model. These are the cloud-user data transmission, the clients data transmission, and the common data transmission. P2CP uses the cloud storage system as a common storage system. When data transmission occurs, the data nodes, cloud user, and the non-cloud user are all involved together to complete the transaction. This thesis also presents a deduplication storage system over cloud computing. Our deduplication storage system consists of two major components, a front-end deduplication application and Hadoop Distributed File System (HDFS). HDFS is common back-end distribution file system, which is used with HBase, a Hadoop database. We use HDFS to build up a mass storage system and use HBase to build up a fast indexing system. With the deduplication application, a scalable and parallel deduplicated cloud storage system can be effectively built up. We further use VMware to generate a simulated cloud environment. The simulation results demonstrate that the storage efficiency of our deduplication cloud storage system is better than traditional cloud storage systems. KEYWORDS: Cloud, Storage, Delete-duplicates, P2CP, P2P

14 Publications Zhe SUN., Jun SHEN, and Jianming YONG. (2010) DeDu: Building a Deduplication Storage System over Cloud Computing. This paper has been accepted by the th International Conference on Computer Supported Cooperative Work in Design (CSCWD 11). Accepted Date: Zhe SUN., Jun SHEN, and Ghassan BEYDOUN. (2010) P2CP: A New Cloud Storage Model to Enhance Performance of Cloud Services. This paper has been accepted by the 2011 International Conference on Information Resources Management in association with the Korea Society of MIS Conference (Conf-IRM-KMIS 11). Accepted Date: viii

15 Chapter 1 Introduction The research reported in this thesis investigates issues in cloud computing. This chapter presents an overview of the research, including the major issues, motivations, methodologies and structure of the thesis. To achieve this, Section 1.1 introduces the history and development status of cloud computing. Then, Section 1.2 describe the major issues existing in the field. Section 1.3 gives the motivations for this work. Section 1.4 presents the methodologies of this work. Section 1.5 gives a summary of the main results of this thesis. The structure of this thesis is given in the final Section Overview In 1966, Douglas Parkhill pointed out the possibility of a computer utility that should include the features of elastic provision, provided as a utility, online and with the illusion of infinite supply just like the electricity industry [20]. In the process of development of cloud computing, Amazon is an indispensable role, as the company launched Amazon Web Service (AWS) on a utility computing basis in 2006, which includes Elastic Compute Cloud (EC2), Simple Storage Service (S3), and SimpleDB. 1

16 1.1. Overview 2 After that Google, Microsoft, IBM, Apache, and so on joined the queue of large scale cloud computing research projects. Google offers Google Apps; Microsoft provides Azure; IBM supplies Blue Cloud to users, and Apache focuses on the research into cloud architecture and cloud computing algorithms. Modern society is a digital universe. Almost all current information could not survive without the digital universe. The size of digital universe in 2007 was 281 Exabyte, and in 2011, it will become 10 times larger than it was in See Figure 1.1 [25]. Figure 1.1: Digital data worldwide This has led to researches with a focus on the development of storage system, which could store and manage such masses of data. However, with the development of storage systems, a great volume of backup data was created periodically. As time goes

17 1.2. Major issues 3 by, these data occupied huge storage space. Furthermore, once we need these backup data, the data transmission bottleneck, which is caused by network bandwidth, will constrain the delivery of data. Thus, the importance of new storage systems, that can delete duplicated data and efficiently deliver data to and from storage system, becomes significant and obvious. 1.2 Major issues From the above brief description, we are well aware of two main issues, with several other sub-issues, that need to be solved in this thesis. The two main issues are that with the significant increased data, new technology is needed to create a huge container to store it; the other is the bottleneck of data transmission on Internet. The sub-issues on new technology to create a huge container include building the mass data storage platform and developing a new technique to manage the vast amount of data. Furthermore, it is important to introduce the deduplication technique into storage systems, which will keep them running efficienctly. On the other hand, the sub-issues on the bottleneck of data transmission on the Internet include introducing the Peer to Peer (P2P) data transmission model into cloud storage systems; solving the problem of data persistent availability; proving that the efficiency of new storage model is better than others, and finding a reliable mathematical model to evaluate the new storage model.

18 1.3. Motivations of this work Motivations of this work With the development of information technology, there are many information systems that run over the network. Thus, the transmission of data which supports basic computing has become extremely important. In [39], 10 obstacles were defined clearly, and two of them are related to our work, data transfer bottlenecks and scalable storage. Based on the current network bandwidth and storage capacity all over the world, it should be possible to satisfy demands from the users. The only problem is that the resources are not distributed in a balanced way. So, the problem shifts to how we can exploit current resources efficiently. In the research of [10], we find that P2P network could efficiently exploit network bandwidth; but, in the work of [34], we find that P2P network can not offer steady and persistent availability. On the other hand, the cloud storage model may solve the problem of data capacity, but with the sacrifice of data space efficiency. The motivation of this work is to improve data transmission efficiency while enhancing the usage efficiency of data space. For the deduplication system, the works of HYDRAstor [14], MAD2 [55], DeDe [8] and Extreme Binning [11] provide limited solutions for deleting duplications. The results of deleting duplication in Extreme Binning and DeDe are not accurate, and both MAD2 and HYDRAstor can not efficiently employ network bandwidth. So, with the aim of improving the accuracy of deduplication system and enhancing network bandwidth usage, we will design and build a new deduplication system over the cloud environment.

19 1.4. Methodologies Methodologies Method of improving data transmission efficiency The method of improving data transmission efficiency is based on research into storage models and mathematic calculation. Firstly, we study the existing storage models and compare their advantages and disadvantage. We find out the bottlenecks in data transmission. Secondly, based on the advantages of different storage models, we propose a new storage model to eliminate the bottleneck in data transmission and offer persistent data availability. Thirdly, based on our mathematical model, we prove that the new storage model is better than others. Finally, we use MATLAB to simulate the process of data transmission and then obtain the results that we need Method of developing deduplicated storage system To build a deduplicated storage system, we divide the whole development process into three main stages. First stage: the scalable parallel deduplicated algorithm should be moved into the cloud. In this stage, the main task is to study a good deduplication algorithm that is scalable and parallel, and transfers it into the cloud computing environment. Second stage:the scalable, parallel deduplication system for a chunk-based file backup system should be optimized to adapt it for the cloud storage system. In this stage, the main task is to optimize the chunk-based file backup system and transfer it into the cloud computing environment. Final stage: the deduplication function in the cloud should be provided to the user via the Internet. In this stage, the main task is to make delelting duplication function as a service, and provide the deduplication function to users via the Internet.

20 1.5. Summary of outcomes Summary of outcomes The main outcomes of this thesis are the design and construction of a deduplication storage system that runs over cloud computing, and the proposal of a new distributed storage system model Peer to Cloud and Peer (P2CP). The specific outcomes are the following: 1. In this research, we analyze the advantages and disadvantages of existing distributed storage systems, especially in the P2P storage model, Peer to Server and Peer (P2SP) storage model, and the cloud storage model. 2. In this research, we propose a new distributed storage system model. This storage model should absorb the advantages of the P2P storage model on high efficient employment of bandwidth, the advantages of the P2SP storage model on persistent data availability, and the advantages of the cloud storage model on scalability. 3. This research proves that the efficiency of data transmission for the new storage model is better than the traditional distributed storage model. We will use the Poisson process and Little s law to set up a mathematical model to calculate data transmission time in a network under the new storage model, and prove that the result is better. 4. A deduplication application will be developed. The application as a front-end will be a part of a deduplication storage system. It may include both a command operation model and a graphical user interface (GUI) model. Users do not need to familiar with command-line, if they use the GUI model. 5. A cloud storage platform will be built. We will employ HDFS and HBase to built

21 1.6. Structure of the thesis 7 the cloud storage platform. We exploit HDFS to manage the commercial hardware, and employ HBase as a data manager to deal with the communication with front-end. When the front-end and back-end can collaborate, the deduplicated storage system will work well. 1.6 Structure of the thesis This thesis is organized as the following: Chapter 2 reviews current work relevant to our research in the areas of cloud, distributed file systems, distribution storage models, and the main deduplication techniques. Chapter 3 gives an analysis of distributed storage models from P2P to cloud. Based on the study of distributed storage model, it proposes a new P2CP storage model and proves that the efficiency of data transmission for P2CP storage model is better than for other systems by using a mathematical model. Chapter 4 proposes a deduplication storage system which works over cloud computing and has been named DeDu. This chapter introduces the design of DeDu and describes the system implementation in details. Chapter 5 shows DeDu s performance, and evaluates both DeDu and the P2CP storage model. Chapter 6 concludes this thesis by providing the advantages and limitations of this

22 1.6. Structure of the thesis 8 study, and describing the future work.

23 Chapter 2 Literature Review 2.1 Introduction In this chapter, there is a review of the literature regarding background knowledge of cloud computing. The Cloud is covered in section 2.2, and then the related work on distributed file systems is introduced in section 2.3. Section 2.4 contains a discussion of related research on distribution storage model, and section 2.5 discusses related work on deduplication storage systems. At the end of the chapter, section 2.6 will summarise this chapter. 2.2 Cloud Background Since the past two years, the idea of cloud computing has been extremely popular, and numerous companies and research institutes have focused on it. They hope that the idea of the cloud will enable them to set up new computing architecture in both infrastructure and applications. The following two sections will briefly describe the 9

24 2.2. Cloud 10 relationships and differences between cloud computing and cloud storage Cloud computing Narrowly defined, cloud computing is distributed, parallel, and scalable computation system based on the user s demand through the Internet. Generalized cloud computing is Internet-based computing, which is provided by computers and other devices on demand to share resources, software and information. This technology evolved from distributed computing, grid computing, Web 2.0, virtualization, and other network technologies. The core idea is the Map/Reduce algorithm. Map/Reduce is an associated implementation and a programming model for generating and processing huge data sets. Users define a map function to processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function to merge all intermediate values associated with the same intermediate key [26]. By exploiting the Map/Reduce programming model, computation tasks will be automatically parallelized and executed on distributed clusters. With the support of a distribution file system and a distribution database which can be built on commodity machines, cloud computing could achieve high performance, just like an expensive super-cluster. Cloud computing consists of both applications and hardware delivered to users as services via the Internet [39]. With the rapid development of cloud computing, more and more cloud services have emerged, such as SaaS (software as a service), PaaS (platform as a service)and IaaS (infrastructure as a service).

25 2.3. Distribution file system Cloud storage Cloud storage is an online storage model. The concept of cloud storage is derived from cloud computing. It refers to a series of distributed storage devices to be accessed over the Internet via Web services application program interfaces (API). With the rocket-like development of cloud computing, the advantages of cloud storage have increased significantly, and the concept of cloud storage has been accepted by the community. The core techniques are the distribution file system and virtualization disk which features large scale, stable, fault-tolerant, and scalable capacity. Thus, in cloud storage, dedicated storage servers are not indeed necessary, and most data nodes are commodity machines. With the support of cloud computing, the capacity of cloud storage could easily reach the petabyte level and keep the response time within few seconds. 2.3 Distribution file system As we mentioned in the previous section, the distribution file system and distributed database are core techniques in the field of cloud computing. These techniques will be introduced in the following section Distributed file system The prototype of the distributed file system could be traced back to Carnegie-Mellon University (CMU)proposed the principles and design of a distributed file system in 1985 [50] and developed Andrew, which is a distributed personal computing environment, with IBM in 1986 [40]. Sun Microsystem designed and implemented

26 2.3. Distribution file system 12 the Sun net file system (Sun-NFS) in 1985 [48]. Both of these systems have some, albeit limited, features of cloud storage. For examples, Andrew consists of approximately 5000 workstations and is much larger than Sun-NFS; Andrew is a research prototype, but not a business product. There is no difference between the clients and servers in Sun-NFS. In the following years, many network storage systems and distributed file systems emerged, such as Venti [43], RADOS [47], Petal [21],Ursa Minor [38], Panasas [13], Farsite [6], Sorrento [29], and FAB [59]. With the development of P2P techniques, a new type of distributed file system appeared. These file systems are based on hierarchical peer-to-peer(p2p) block storage, with some peers playing the role of master, while some of the peers are data nodes, such as the Eliot file system [51] and Serverless Network Storage [61]. Google File System (GFS) represents a milestone in modern distributed systems. It is a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. The architecture of GFS is shown in Figure 2.1 [27]. Figure 2.1: GFS Architecture [27].

27 2.3. Distribution file system 13 Because of the really excellent performance of GFS, many other distributed file systems are modeled on it, such as the Hadoop distributed file system (HDFS) [12] developed by Apache. The advantages of HDFS are that it is open source, can run on commodity hardware, and provides high throughput access to Web application data. Detailed information on HDFS will be introduced in Chapter 4, section Distributed database The first generation distributed database was the System for Distributed Databases (SDD-1), the design of which was initiated in 1976 and which was completed in 1978 by Computer Corporation of America [46]. In the 1970s, many problems and techniqcal rules and issues concerning distributed databases had been encountered and solved, including distributed concurrency control [9], distributed query processing [22], resiliency to component failure [52], and distributed directory management [46]. In comparison with the modern distributed database, BigTable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. A BigTable is a sparse, distributed, persistent, multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes [15]. BigTable provides a simple data model which gives clients dynamic control over data layout and format; furthermore, by exploiting an analogous three-level hierarchy of a B + -tree [19] to store tablet location information, the architecture of which is shown in Figure 2.2 [15], and due to the GFS architecture, BigTable is easy to scale up to increase the capacity and offers a high input/output (I/O) throughput, reducing the response delay.

28 2.4. Distributed storage model 14 Figure 2.2: Table location hierarchy for BigTable [15]. The most famous imitation of BigTable is HBase, which was developed by Apache [5], and Cassandra, which was open sourced by Facebook in 2008 and is also being developed by Apache now [4]. HBase completely inherits the architecture of BigTable and operates over HDFS, so it supports the Map/Reduce algorithm and mass data storage very well. The detailed information on HBase will be introduced in Chapter 4, section Cassandra is focused on growing to distributed database with full function. 2.4 Distributed storage model In this section, some distributed storage systems will be introduced and distributed storage model will be classified. According to the features of distributed storage systems, it is easy to classify them into different categories based on distributed coordinator, file storage granularity, and system scalability.

29 2.4. Distributed storage model Distributed coordinator storage system Serverless Network Storage (SNS) is a persistent peer-to-peer network storage application. It has four layers: operation logic; a file information protocol (FIP) that exploits XML-formatted messages to maintain files and disk information; a proposed security layer; and a serverless layer which is responsible for routine network state information [61]. FAB [59] is a distributed enterprise disk array from commodity components. It is a high reliable and continuous service enterprise storage system using two new mechanisms, a voting based protocol and a dynamic quorum-reconfiguration protocol. With the voting-based protocol, each request makes progress after receiving replies from a random quorum of storage bricks, and by this method, any brick can be a coordinator. The dynamic quorum-reconfiguration protocol changes the quorum configuration of segment groups. At the University of California, they have developed a distribution file system named Ceph [57], which provides excellent performance, reliability, and scalability. It occupies a unique point in the design space based on CRUSH, which separates data from metadata management, and RADOS [47] which exploits intelligent object storage devices (OSD) to manage data without any central servers. EBOFS [56] provides more appropriate semantics and superior performance by addressing the workloads and interface. In the end, Ceph s scalable approach of dynamic sub-tree partitioning offers both efficiency and ability to adapt to a varying workload. RADOS is a reliable, automatic, distributed object store that exploit device intelligence to distribute consistent data access and provide redundant storage, failure detection, and failure recovery in clusters. Both Ceph and RADOS target a high performance cluster or data center

30 2.4. Distributed storage model 16 environment Files storage granularity Files storage at block level Petal [21] offers always available and unlimited performance and capacity for large numbers of clients. To a Petal client, this system provides large virtual disks. This technique is a distributed block-level storage system that tolerates and recovers from single computer failure; dynamically balances workload, and expands capacity. Venti [43] is a network storage system. It uses unique hash values to identify block contents; with this method, it reduces the data occupation of storage space. Venti builds block for mass storage applications and enforces a write-once policy to avoid destruction of data. This network storage system emerged in the early stage of network storage, so it is not suitable to deal with mass data, and the system is not scalable. Panasas [13] is a scalable and parallel file system. It uses storage nodes that run an OSDFS object store and manager nodes that run a file system metadata manager, a cluster manager, and a Panasas client that can be re-exported via NFS and CIFS. Based on balancing the resources of each node, the system achieves the scalability. The system exploits non-volatile memory to hide latency and protect caches, and has the ability to distribute metadata into block files, as well as storage node and data nodes are maintained in the storage clusters to achieve a good performance.

31 2.4. Distributed storage model Files storage at file level Farsite [6] is federated, available, and reliable storage system for an incompletely trusted environment. The core architecture of Farsite includes a collection of interacting, Byzantine-fault-tolerant replica groups, which are arranged in a tree that overlays the file system namespace hierarchy. The availability and reliability of the system is provided by randomized replicated storage; the secrecy of file contents is offered by cryptographic techniques; the Byzantine-fault-tolerant protocol maintains the integrity of files and directories. The scalability is offered by exploiting the distributed hint mechanism and delegation certificates for path name translations System scalability FS2You [60] is a large-scale online file share system. It has four main components, which are the directory server, tracking server, replication servers, and peers. With the peers assistance, it makes semi-persistent files available and reduces the server bandwidth cost. FS2You exploits hash values to identify data, but does not offer a deduplication function. The Eliot file system [51] is a reliable mutable file system based on peer-to-peer block storage. Eliot exploits a metadata service in an auxiliary replicated database which is separated and generalized to isolate all mutation and client state. The Eliot file system consists of several components, which are: an untrusted, immutable, reliable P2P block storage substrate known as the Charles block service; a trusted, replicated database, known as the metadata service (MS), storing mutable nodes, directories, symlinks, and superblocks; a set of file system clients; and zero, one, or more cache servers. Cache servers are intended to improve performance, but are not necessary for

32 2.5. Related work 18 correctness. RUSH [30] is a family of algorithms for scalable decentralised data distribution. RUSH maps replications in storage servers or disks to form a scalable collection. RUSH algorithms are based on user-specified server weighting of decentralised objects to servers. There is no central directory, so different RUSH variants have different look-up time. Ursa Minor [38] is a cluster-based storage system which has four components: storage nodes, object manager, client library, and NFS server with a protocol family which includes a timing model, storage-node failure model, and client failure model for access to allow data-specific selection, as well as on-line changes, encoding schemes, and fault models. In this way, Ursa Minor has achieved a good result for tracing, OLTP, scientific and campus workload. 2.5 Related work Existing deduplicated storage systems HYDRAstor [14] is a scalable, secondary storage solution, which includes a back-end consisting a grid of storage nodes with a decentralized hash index and a traditional file system interface as a front-end. The back-end of HYDRAstor is based on Directed Acyclic Graph (DAG), which has organized large-scale, variable-size, contentaddressed, immutable, and highly-resilient data blocks. HYDRAstor detects duplications according to the hash table. This approach s target is a backup system. It does not consider the situation of multiple users needing to share files. Extreme Binning [11] is a scalable, paralleled deduplication approach aimed at a

33 2.5. Related work 19 non-traditional backup workload which is composed of low-locality individual files. Extreme Binning exploits file similarity instead of locality and allows only one disk access for chunk look-up per file. Extreme Binning organizes similar files into bins and deletes replicated chunks inside each bin. Replicates exist among different bins. Extreme Binning only keeps the primary index in memory in order to reduce RAM occupation. Since this approach is not an exact deduplication method, duplications may exist among bins. MAD2 [55] is an deduplication network backup service which works at both the file level and the chunk level. It uses four techniques: a hash bucket matrix (HBM), Bloom filter array (BFA), dual cache, and DHT-base load balancing, to achieve high performance. This approach is not an exact deduplication method. DeDe [8] is a block-level deduplication cluster file system without central coordination. In the DeDe system, summaries of each host which are written to the cluster file system are kept in hosts. Each host submits summaries to share the index and reclaims duplications periodically and independently. These deduplication activities do not occur at the file level, and the results of deduplication are not accurate. Duplicate Data Elimination (DDE) [28] employs a combination of content hashing, copy-on-write, and lazy updates to achieve the functions of identifying and coalescing identical data blocks in a storage area network (SAN) file system. It always processes in the background.

34 2.5. Related work Method of identifying duplications From the previous study we know that the existing approaches for identifying duplications always work on two different levels. One is the file level, such as MAD2; the other is the chunk level, for example, HYDRAstor, Extreme Binning, DDE and DeDe. To handle scalable deduplication, two famous approaches have been proposed, sparse indexing [35] and Bloom filters [62] with caching. Sparse indexing is a technique to solve the chunk look-up bottleneck, which is caused by disk access, by using sampling and exploiting the inherent locality within backup streams. It picks a small portion of the chunks in the stream as samples; then, the sparse index maps these samples to the existing segments in which they occur. The incoming streams are broken up into relatively large segments, and each segment is deduplicated against only some of the most similar previous segments. The Bloom filter exploits Summary Vector, which is a compact in-memory data structure for identifying new segments; Stream-Informed Segment Layout, which is a data layout method to improve on-disk locality for sequentially accessed segments; and Locality Preserved Caching with cache fragments, which maintains the locality of the fingerprints of duplicate segments to achieve high cache hit ratios. Walter Santos et.al. proposed a scalable parallel deduplication algorithm [49]. This algorithm was developed in an anthill programming environment and exploits task and data parallelism, multi-programming, and message coalescing techniques to achieve scalability. Their parallelization is based on four filters: reader comparator, blocking, classifier, and merger.

35 2.5. Related work Hash algorithm Hash value is addressed as a message digest or simply digest. Any changes in the original data will lead to a hash value change. Thus hash value is widely used into cryptographic message and data identification. There are several cryptographic hash functions, and the most famous hash algorithms are the Message-Digest algorithm (MD) series. MD2 was designed by Kaliski [32], which inputs a message of arbitrary lengh and outputs a 128-bit message digest of the input which was intended for digital signature. MD4 was designed by Ronald Rivest in 1990 [45]; This algorithm is implemented for use in message integrity checks. The digest length also is 128 bits. This algorithm has influenced later cryptographic hash functions such as MD5, SHA-1 and RACE Integrity Primitives Evaluation Message Digest (RIPEMD). However, the first full collision attack against MD4 happened in To replace MD4, Ronald Rivest designed MD5 in 1992 [44]. MD5 has been widely used for security applications and is commonly used to check the integrity of files. However, on August 17, 2004, Xiaoyun Wang et.al. announced collisions for the full MD5, and this was published in Ref [54] in The National Security Agency (NSA) in the U.S.A. designed a series of cryptographic hash functions named Secure Hash Algorithm (SHA) that were published by the National Institute of Standards and Technology (NIST) [41]. They include SHA-1, SHA- 2, and SHA-3. A 160-bit message digest is produced by SHA-1 based on similar principles to MD4 and MD5 Ronald, but it has a more conservative design. SHA-1 shows a strong resistance to attacks. SHA-2 provides two specifications of the message digest to the users: one is a 256/224 bit message digest, and the other one a 512/348 bit message digest. Furthermore, a new hash standard, SHA-3, is being developed now. Even now, there is no report of a collision attack on these series of cryptographic

36 2.5. Related work 22 hash functions, except for SHA-0, which was withdrawn in RACE Integrity Primitives Evaluation Message Digest (RIPEMD) is a 128-bit message digest algorithm which was designed by Hans Dobbertin based on MD4 and SHA-1. However, it was broken by Xiaoyun Wang [53]. Soon, Hans designed RIPEMD-160 to improve the hash function, and RIPEMD-128, RIPEMD-256, and RIPEMD-320 were developed [3]. Besides these hash functions, GOST [36], HAVAL [33], Panama, and RadioGatun exist but have not been adopted widely Study of cloud storage model In Feng and et.al s paper [24], they analyzed several current existing cloud storage platforms, such as Simple Storage Service, Secure Data Connector, and Azure Storage Service, with the focus on the problem of security. Furthermore, they pointed out the problem of repudiation. They proposed and specifically designed a non-repudiation protocol suitable for the cloud computing environment by using third authorities certified (TAC) and secret key sharing (SKS) techniques. In Fang and et.al s works [23], they analyze the differences between the pure P2P network and the P2SP network, and their assumption is that the peer arrival and departure rates follow the Poisson process or Little s law. Finally, they proved that P2SP has higher performance than P2P based on two assumptions. Tahoe [58] is an open source grid storage system which has been deployed in a commer-

37 2.6. Summary 23 cial backup service and is currently operating. It uses capabilities for access control, cryptography for confidentiality and integrity, and erasure coding for fault-tolerance. Dropbox [31] is an on line storage system. Dropbox allow users to sync files automatically, and share files between different users via Web services. It could be accessed by mobile devices. The back-end is a cloud storage platform. 2.6 Summary This chapter reviewed some literature relevant to our research from four aspects. In the first section, the background on cloud, cloud computing, and cloud storage was introduced. In section 2.3, existing distributed file system and distributed databases were reviewed. In section 2.4, existed distribution storage models were analyzed from three aspects, where the master servers were distributed in the storage system, file storage granularity, and system scalability. In the section 2.5, some existing deduplication storage systems were reviewed, and the main deduplication techniques were introduced. Furthermore, related works on both the P2SP and the P2P storage models were reviewed.

38 Chapter 3 P2CP: Peer To Cloud And Peer 3.1 Introduction Cloud computing is undergoing rapid development. Google, Amazon, Microsoft, and many other companies have recently been focusing on cloud computing and releasing related storage products, such as Google file system (GFS), Amazon Elastic Compute Cloud (EC2), Azure, etc. All of these are based on cloud distributed storage models. During the download session, the data transmission between cloud users is zero. This decreases the utilization rate of bandwidth. The current alternative file sharing protocol, Peer to Peer (P2P), has high utilization rates of bandwidth, but it has the problem that it cannot offer continuous availability. To solve these problems, we have studied several existing distribution storage models and propose in this paper the P2CP storage model as the solution. This model exploits the P2P protocol to enhance the data transmission performance and, at the same time, uses a cloud storage system to provide continuous availability. We assume that the P2CP model follows the Poisson process or Little s law. We mathematically prove that the speed and availability of P2CP is indeed better than the pure P2P model or the Peer to Server and Peer 24

39 3.2. Existing distributed storage models 25 (P2SP) model or pure Cloud model. 3.2 Existing distributed storage models In this section, we study some existing distributed storage models, including the P2P model, the P2SP model, and the cloud storage model Peer to peer storage model In a pure P2P storage model, each peer is equal. Peers act as both clients and servers. In the P2P storage model, there is no master server to manage the network, metadata, and data. Especially in a commercial machine environment, each peer is mutable, which makes the whole network unstable. A particular problem is offering persistent availability of a specific file. Typical applications are Gnutella before version 0.4 [34], Freenet [17], Sorrento [29], etc. The architecture of the pure P2P storage model is shown in Figure 3.1. In a P2P storage model, users get data from each other, when a user joins to the network, they become a server or a seed. The advantage of the P2P storage model is that it efficiently exploits the network bandwidth, but sometimes, the server or seed that contains the particular resource does not exist in the network, so the file sharing process has to stop. So, the disadvantage of the P2P storage model is that it is hard to offer persistent data availability.

40 3.2. Existing distributed storage models 26 Figure 3.1: P2P Storage Model Peer to server and peer To solve the problem of persistent availability in the pure P2P storage model, a hybrid P2P model that emerged is Peer to Server and Peer (P2SP). In this storage model, peers are distributed into the client group or the server group. The client group is responsible to handle the data transmission, and the server group acts as a master server to coordinate the P2P structure. However, the workload of the master servers is very heavy, and furthermore, without the server group, the P2P network does not work. Typical P2SP applications are emule [37], BitTorrent [18], FS2You [60], etc. For example, FS2You is a large-scale online storage system. With the peers assistance, it makes semi-persistent files available and reduces the server bandwidth cost. When the clients are going to download data, firstly, they download data from the server, and then, they exchange data with each other. If the other peers are not available, the client will download all the data from the server. No matter the server is an cluster or a distributed server system, the client will connect to physical one single server

41 3.2. Existing distributed storage models 27 which in the cluster or distributed server system. The architecture of P2SP is shown in Figure 3.2. Figure 3.2: P2SP Storage Model Cloud storage model Cloud computing consists of both applications and hardware delivered to users as services via the Internet. With the rapid development of cloud computing, more and more cloud services have emerged, such as SaaS (software as a service), PaaS (platform as a service) and IaaS (infrastructure as a service).

42 3.2. Existing distributed storage models 28 The concept of cloud storage is derived from cloud computing. It refers to a storage device accessed over the Internet via Web service application program interfaces (API). There are many cloud storage systems in existence, for example, Amazon S3 (Amazon, 2006), the Google file system [27], HDFS [12], etc. The traditional cloud storage system is a scalable, reliable, and available file distribution system with high performance. These system models consist of master nodes and multiple chunk servers. Data is accessed by multiple clients; all files in the system are divided into fixed size chunks. The master node (DFS master) maintains all file system metadata. The master node asks each chunk server about their own chunks of information at master start-up and whenever a chunk server joins the cluster. Clients never read and write file-data through the master, but ask the master which chunk server they should contact. The problem is that clients get data from the individual data nodes, but the clients do not have any communication among themselves. The architecture of the cloud storage model is shown in Figure 3.3.

43 3.3. A new cloud model: P2CP 29 Figure 3.3: Traditional Cloud Storage Model 3.3 A new cloud model: P2CP We propose a new cloud storage model, which is the peer to cloud and peer (P2CP) model. This means that cloud users can download data from the storage cloud and exchange data with other peers at the same time, regardless of whether the other peers

44 3.3. A new cloud model: P2CP 30 are cloud users or not. There are three data transmission tunnels in this storage model. The first is the cloud-user data transmission tunnel. The cloud-user data transmission tunnel is responsible for data transactions between the cloud storage system and the cloud users. The second is the clients data transmission tunnel. The clients data transmission tunnel is responsible for data transactions between individual cloud users. The third is the common data transmission tunnel. The common data transmission tunnel is responsible for data transactions between cloud users and non-cloud users. Figure 3.4 is an example to show how a P2CP cloud model works. In Figure 3.4, we can see that cloud user2 is downloading data from data node 1, which is in the cloud, and at the same time, cloud user2 is exchanging data with cloud user1, cloud user3, and common peers 2, 5, and 6. By exploiting multiple data transmission tunnels, cloud users can achieve a high download speed. On the other hand, the P2CP model avoids extremely high workloads for cloud servers as the number of cloud users increases. When the resources are committed to other transmitting activities, non-cloud users may still get access to resources in the cloud which are not in common with the P2P networks.

45 3.3. A new cloud model: P2CP 31 Figure 3.4: P2CP storage model Other existing models such as Groove [42] as known as comparable to Microsoft SharePoint [16], and Tahoe [58] tended to balance loads between peers and cloud serves in different ways. However, in our P2CP model, peers may communicate directly and flexibly between each other without tight dependence on servers, though some ad-

46 3.4. Comparison 32 vanced features such as backing up, caching, security and versioning of data may still be elevated or mitigated to servers because peers storage and computing capacities are supposedly inferior to those cloud servers. Before the comparison, we will briefly describe the roles and the functions in the pure P2P storage model, the P2SP storage model, and the Cloud storage model. In the pure P2P storage model, peers are divided into seeds, which are denoted by S, and leeches, which are denoted by L. Initially, seeds have the whole file, and leeches do not have any block of the file, but as time passes, leeches obtain blocks and exchange blocks with other peers. When the leeches get the whole file, they may leave the network or stay in the network as seeds. In the P2SP network storage model, the difference is that it has a server group. Normally, in the Cloud storage model, there are three replicas of the file existing in different data nodes, and each data node keeps different amounts of blocks of the file. In the P2CP storage model, the storage cloud replaces the role of the server in the P2SP model. 3.4 Comparison In this section, we evaluate our P2CP storage model against the key three storage models described in Section 2.5.4: the pure P2P model, the cloud model and the P2SP model. For the network storage models, the two most important parameters for performance are average downloading time and usability. In this part, we will compare all these storage models in terms of average downloading time and usability, and evaluate the P2CP model. In this part, we will compare average downloading times of the above models by a mathematical model. We assume: Seed: each seed upload bandwidth is U s ; the number of seeds is N s. Peer: each seed upload bandwidth is U p ; the number of peers is N p.

47 3.4. Comparison 33 Server: for each server, average upload bandwidth is U se ; the number of servers is N se. The average number of peers and seeds is N. Cloud: for each data node upload, bandwidth is U c ; the number of data nodes is N c. F is the size of the file. T is the average downloading time. t is the current time of data transmission happening. O is usability. U is the average upload bandwidth of peers and seeds. λ: arrival rate of peers arrive at the network. µ: departure rate of peers leave the network. λ must be greater than µ, otherwise, P2P network will not exist Comparison based on Poisson process The Poisson process is very useful for modelling purposes in many practical applications. It has been empirically found to well approximate many circumstances arising in stochastic processes [1]. We assume that peers arrive and leave nearly according to a Poisson process. This assumption is consistent with literature [23]. The numbers of peers and seeds existing in the pure P2P network are modeled on the M/G/ queue. We assume that two peers constitute the smallest pure P2P network; the smallest P2SP network includes one server and one peer; the smallest cloud includes one master node and one data node; and the smallest P2CP network includes one smallest cloud and one peer. We can get the number of peers and seeds that exist in the pure P2P network with time goes: N = (λ µ)t (3.1)

48 3.4. Comparison 34 If a peer costs T time to download a file with size F in the P2P network, we get: N s k=1 T 0 U s dt = F (3.2) N s [(λ µ)u s ] 1 k=1 2 T 2 = F (3.3) If it costs a peer time T to download a file with size F in the P2SP network, we get: N s k=1 T 0 U s dt + T N se k=1 U se = F (3.4) N s [(λ µ)u s ] 1 k=1 2 T 2 N se + T U se = F (3.5) k=1 If it costs a peer time T to download a file with size F in the P2CP network, we get: N s k=1 T 0 U s dt + T N c k=1 U c = F (3.6) N s [(λ µ)u s ] 1 k=1 2 T 2 N c + T U c = F (3.7) k=1 If it costs a peer time T to download a file with size F in the Cloud system, we get: N c U c T = F (3.8) The relationship between the number of cloud host servers and relative throughput is volatile and according the normal Cloud storage system configuration, we get 3.9 for convenience of computation. N c 3N se (3.9) We assume that: N A = (λ µ)u (3.10) k=1

49 3.4. Comparison 35 C = N c k=1 B = N se k=1 U c = 3 U se (3.11) N se k=1 According to (3.3), (3.5), and (3.10), (3.11), we get: U se = 3B (3.12) According to (12) and (14) we get: A 2 T 2 F = 0 (3.13) A 2 T 2 + BT F = 0 (3.14) A 2 T 2 + CT F = A 2 T 2 + 3BT F = 0 (3.15) Then, according to equations (3.13), (3.14) and (3.15), we get: T c = T P 2P = 2F 6U se N se (3.16) 2F 2F A (3.17) T P 2CP = T P 2SP = 2F B2 + 2F A + B 2F C2 + 2F A + C = 2F 9B2 + 2F A + 3B (3.18) (3.19) For our comparative purposes, we assume that the size of the file is 100,000 KB, the upload bandwidth of the peers and seeds are 20 KB/s, the upload bandwidth of a server is 100KB/s, the arrival rate of peers is 2 peers/s and the departure rate is 1 peer/s. When the peers and servers arrival rate is lower than departure rate, the number of peers and seeds will go to zero, and then the P2P network will not existing.

50 3.4. Comparison 36 Figure 3.5 clearly shows that the alternative with minimal cost of download time is P2CP. The maximum download time is found with P2P. P2SP falls in the middle when there are not too many peers. The difference in download time is quite obvious. When more peers join the network, download time decreases. Figure 3.5: Time for download This clearly shows that the alternative with minimal cost of download time is P2CP. The maximum cost of download time is found with P2P. P2SP is in the middle. When there are not too many peers, the difference in download time is obvious. When more peers join the network, the cost of download time becomes less and less. With the growth of upload bandwidth for the peers, we have another test. Assume that the size of a file is 100,000 KB, and the upload bandwidths for peers and seeds

51 3.4. Comparison 37 are 20 KB/s, 40 KB/s, or 60 KB/s, while the upload bandwidth for the server is 100 KB/s. The arrival rate of peers is 2 peers/s, and the leaving rate is 1 peer/s. Figure 6 shows that when there is an increase of upload bandwidth for the peers, the download time inversely decreases. At the same time, differences in download time between P2P, P2SP, and P2CP are also reduced. Pure Cloud storage model performance is not shown in Figure 3.6, because the result changes significantly. In some instances it outperforms the P2P and the P2SP models depending on the chunk distribution in the cloud storage system, but it never outperforms our P2CP storage model. Figure 3.6: Comparing download time Comparison based on Little s law It is difficult to prove that the peer and seed arrival and departure rates are accurate according to the Poisson process. Therefore we use Little s law to relate L (number of peers), W (sojourn time), and λ (average number of users) [1] as 3.20:

52 3.4. Comparison 38 L = λw (3.20) Based on Little s law, we can get: N = (λ µ)t (3.21) According to (3.10) and (3.21), we obtain: A = Then, according the equation, we obtain: N k=1 N T U (3.22) T P 2CP = T c = 2F 6B T P 2P = 2F NU T P 2SP = 2F NU + 2B 2F NU + 2C = 2F NU + 6B (3.23) (3.24) (3.25) (3.26) According the (3.24),(3.25) and (3.26), we obtain: T P 2P T P 2SP T P 2CP (3.27) P2P. Thus, minimum download time is possible with P2CP, then P2SP, and lastly with

53 3.5. Summary Summary This chapter firstly introduced several existing distributed storage models, including the P2P storage model, the P2SP storage model, and the Cloud storage model, and furthermore, analyzed their own advantages and disadvantages. Secondly, based on these drawbacks, we proposed a new cloud storage model called the P2CP storage model to improve data transmission efficiency in a distributed storage system. After that, we compared the downloading efficiency of the P2CP storage model with the other storage models by mathematical methods, and then use MATLAB to calculate the results.

54 Chapter 4 DeDu: A Deduplication Storage System Over Cloud Computing 4.1 Introduction In the early 1990s, the once write multi read storage concept, a storage medium, which was typically an optical disk, was set up and used widely. The disadvantages of this storage concept were the difficulty in sharing data via the Internet and the enormous wastage of storage space to keep replications, so once write multi read fell into disuse. However, in this network storage era, based on the concept of once write multi read, we propose a new network storage system, which we have named DeDu, to store data and save storage space. The idea is that, when the user uploads a file for the first time, the system records this file as source data, and the user will receive a link file for guiding users to the source data. When the source data has been stored in the system, if the same data has been uploaded by other users, the system will not accept the same data as new, but rather, the user who is uploading data will receive a link file to the original source 40

55 4.1. Introduction 41 data. Users are allowed to read source data but not to write. Under these conditions, the source data can be reused by many users, and furthermore, there will be a great saving of storage space. The architecture is shown in Figure 4.1. Figure 4.1: The architecture of source data and link files Identifying the duplication There are two ways to identify duplications in a cloud storage system. One is comparing blocks or files bit to bit, and the other is comparing blocks by hash values. The advantage of comparing blocks or files bit to bit is that it is accurate, but it is also time consuming. The advantage of comparing blocks or files by hash value is that it is very fast, but there is a chance of accidental collision. The chance of accidental collision depends on the hash algorithm. However, the chance is really very small. Furthermore, the combination digital message of MD5 and SHA-1 will significantly

56 4.1. Introduction 42 reduce the probability. Therefore, it is absolutely acceptable to use a hash function to identify duplications [19, 20]. The existing approaches for identifying duplications always work on two different levels. One is the file level; the other is the chunk level. On the chunk level, data streams are divided into chunks, each chunk will be hashed, and all these hash values will be kept in the index. The advantage of this approach is that it is convenient for a distributed file system to store chunks, but the drawback is the increasing quantity of hash values. It means that hash values will occupy more RAM usage and increase the lookup time. On the file level, the hash function will be executed for each file, and all hash values will be kept in the index. The advantage of this approach is that it decreases the quantity of hash values significantly. The drawback is that, when the hash function has to deal with a large file, it will become a little bit slow. In this manuscript, our deduplication method is at file level based the comparison of by hash values. There are several hash algorithms, including MD5, SHA-1, and RIPEMD. We use both the SHA-1 and the MD5 algorithms to identify duplications. Although the probability of accidental collision is extremely small, we still combine MD5 and SHA-1 together. We merge the MD5 hash value and the SHA-1 hash value as the primary value in order to avoid accidental collision. If the MD5 algorithm and the SHA-1 algorithm are not suitable for our system scale, this can be changed at any time. The reason for choosing file level deduplication is that we want to keep the index as small as possible in order to achieve high lookup efficiency.

57 4.1. Introduction Storage mechanism We need two storage mechanisms to achieve our data access requirements. One is used to store mass data, and the other one is used to keep the index. On the one hand, there are several secondary storage systems, such as CEPH, Petal, Farsite, Sorrento, Panasas, GFS, Ursa Minor, RADOS, FAB, and HDFS, which can be used as mass data storage systems. On the other hand, there are several database systems such as SQL, Oracle, LDAP, BigTable, and HBase that can be used as index systems. All these systems have their own features, but which two systems combined together will yield the best results? With regard to the storage system requirements, in order to store masses of information, the file system must be stable, scalable, and fault-tolerant; for the index, the system must perform nicely in real time queries.

58 4.1. Introduction 44 Figure 4.2: Architecture of deduplication cloud storage system Considering these requirements, we use HDFS and HBase as our storage mechanisms. The advantages of HDFS are that it can be used under high throughput and large dataset conditions, and it is stable, scalable, and fault-tolerant. HBase is a Hadoop database that is advantageous in queries. Both HDFS and HBase were developed by Apache, with the aim of storing mass data which was modelled by Google File System and BigTable. Considering these features, in our system, we use HDFS as a storage system and use HBase as the index system. We will introduce how HDFS and HBase collaborate in Section 4. Figure 4.2 shows the architecture of the deduplication cloud storage system.

4.2. Design of DeDu 45 4.2 Design of DeDu 4.2.1 Data organization In this system, HDFS and HBase must collaborate to guarantee that the system is working well. Figure 4.

59 4.2. Design of DeDu Design of DeDu Data organization In this system, HDFS and HBase must collaborate to guarantee that the system is working well. Figure 4.3 shows the collaborative relationship of HDFS and HBase. Figure 4.3: Collaborative relationship of HDFS and HBase[12] There are two types of files saved in HDFS, one is source files, and the other one is link files. We separate source files and link files into different folders. Figure 4.4 shows the architecture of data organization.

60 4.2. Design of DeDu 46 Figure 4.4: Data organization in DeDu In the DeDu system, each source file is read by its primary value and saved in a folder which is named by date. As for the link file, the filename is in the form ABC.ext.lnk, where ABC is the original name of the source file, and ext is the original extension of the source file. Every link file records one hash value for each source file and the logical path to the source file, and it is saved in the folder which was created by the user. Each link file is 316 bits, and each link file is saved three times in the distribution file system. HBase records all the hash values for each file, the number of links, and the logical path to the source file. There is only one table in HBase, which is named dedu. There are three columns in the table, which have the headings hash value, count, and path. Hash value is the primary key. Count is used to calculate the number of links for each source file. Path is used for recording the logical path to the source file.

61 4.2. Design of DeDu Storage of the files In this system, there are three main steps to save a file. Firstly, make a hash value at the client; secondly, identify any duplication; thirdly, save the file. Figure 4.5 shows the procedures for storing a file. Firstly, users select the files or folders which are going to be uploaded and saved by using a DeDu application. The application use the MD5 and SHA-1 hash functions to calculate the file s hash value, and then pass the value to HBase. Secondly, the table dedu, in HBase keeps all file hash values. HBase is operated under the HDFS environment. It will compare the new hash value with the existing values. If it does not already exist, a new hash value will be recorded in the table, and then HDFS will ask clients to upload the files and record the logical path; if it does already exist, HDFS will check the number of links, and if the number is not zero, the counter will be incremented by one. In this case, HDFS will tell the clients that the file has been saved; if the number is zero, HDFS will ask the client to upload the file and update the logical path. Thirdly, HDFS will store source files, which are uploaded by users, and corresponding link files, which are automatically made by DeDu and record the source file s hash value and the logical path of the source file.

62 4.2. Design of DeDu 48 Figure 4.5: Procedures for storing a file.

63 4.2. Design of DeDu Access to the files In our system, we use a special approach to access a file, which is the link file. Each link file records two things: the hash value and the logical path to the source file. When clients access the file, they first access the link file, and the link file will pass the logical path of the source file to HDFS. HDFS will then ask the master node for the block locations. When the clients get the block locations, they can retrieve the source file from the data nodes. Figure 4.6 shows the procedures to access a file. Figure 4.6: Procedures to access a file Deletion of files In our system, there are two types of models for deletion: in one case, the file is pseudo-deleted, and in the other case, it is fully-deleted. This is because, in our system, different users may have the same right to access and control the same file. We

64 4.2. Design of DeDu 50 don t allow one user to delete a source file which is shared by other users, so we use pseudo-deletion and full-deletion to solve this problem. When a user deletes a file, the system will delete the link file which is owned by the user, and the number of links will be decremented by one. This means that this particular user loses the right to access the file, but the source files are still stored in HDFS. The file is pseudo-deleted. A source file may have many link files pointing to it, so while the user may delete the one link file, this has no impact on the source file. When the last link file has been deleted, however, the source file will be deleted; the file is now fully-deleted. Figure 4.7 shows the procedures for deleting a file. Figure 4.7: Procedures to delete a file.

65 4.3. Environment of simulation Environment of simulation In our experiment, our cloud storage platform was set up on a VMware 7.10 workstation. The configuration of the host machine is that the CPU is 3.32 GHz; RAM is 4 GB; Hard disk is 320GB. Five virtual machines exist in the cloud storage platform, and each virtual machine has the same configuration. The configuration of each virtual machine is that the CPU is 3.32 GHz; RAM is 512 MB; Hard disk is 20 GB. The operating system is Linux mint. The version of HDFS is Hadoop , and the version of HBase is The communications between nodes of HDFS and HBase use the security shell. The detailed information is listed in Table 4.1. Hosts Usage IPaddress DFS Mint1 MasterNode, HBase Hadoop, HBase Mint2 DataNode Hadoop, HBase Mint3 DataNode Hadoop, HBase Mint4 DataNode Hadoop, HBase Mint5 DataNode Hadoop, HBase Table 4.1: Configuration of virtual machine 4.4 System implementation In this section, we will introduce the implementation of DeDu in packages, classes, and interfaces System architecture There are five packages in the system. Those are main package, GUI package, command package, task package, and utility package. The relationship of these packages is

66 4.4. System implementation 52 shown in the Figure 4.8. It is clear that the main package and the GUI package invoke the command package to operate DeDu. And then, the command package calls, the task package to action, and the task package invokes the utility package to execute the missions. Figure 4.8: Over View of Package In each package, several classes are included to perform the functions. Figure 4.9 gives an overview of the architecture of DeDu. ICommand and ITask are the interfaces of the command package and the task package. Detailed information and the functions of classes will be described in section.

67 4.4. System implementation 53 Figure 4.9: System Implementation

68 4.4. System implementation Class diagram We will describe the system classes package by package. The first is the command package. There is one interface, which is ICommand, and one abstract class, which is AbstractCommand. ICommand is used as an interface for the AbstractICommand class. Also, there are nine common classes, including DownloadCommand, DownloadDirCommand, DeletedCommand, HelpCommand, UpdateCommand, MkDirCommand, RmDirCommand, InitHBaseCommand, and ListCommand, which inherit from the AbstractICommand class in this package. DownloadCommand class is the command to download a single file. DownloadDirCommand class is the command to download multiple files in the document folder. DeletedCommand class is the command for deleting files. HelpCommand class is the command for calling on all the help information. UpdateCommand class is the command to upload files. MkDirCommand class is the command for making a new document folder. RmDirCommand class is the command for deleting a document folder and files in the folder. InitHBaseCommand class is the command for to initialize HBase. ListCommand class is the command for listing all the files in the system.

69 4.4. System implementation 55 Figure 4.10: Commands Package

70 4.4. System implementation 56 The task package is shown in Figure There is one interface which is ITask and one abstract class which is AbstractTask. ITask is used as an interface for the Abstract- Task class. Also, there are nine common classes including DownloadTask, Download- DirTask, DeletedTask, MkDirTask, RmDirTask, UploadTask, UploadDirTask, List- Task, and ListDirTask, which inherit from the AbstractITask class in this package. Figure 4.11: Task Package

71 4.4. System implementation 57 DownloadTask class performs the function of downloading a single files. DownloadDirTask class performs the function of downloading multiple files in the document folder. DeletedTask class performs the function of deleting files. MkDirTask class performs the function of making a new document folder. RmDirTask class performs the function of deleting a document folder and files in the folder. ListTask class performs the function of listing all the files in the system. ListDirTask class performs the function of listing all the files and folders in the system. UploadTask class performs the function of uploading single files. UploadDirTask class performs the function of uploading multiple files which are in the same folder. The utility package contains one abstract class, named HashUtil, one interface HashUtil- Factory, and five classes, which are XMLUtil, HBaseManager, HDFSUtil, SHAUtil and MD5Util. The structure of the Utility package is shown in Figure 4.12.

72 4.4. System implementation 58 Figure 4.12: Utility Package In this package, we used the factory method to build up the hash function. HashUtil- Factory is the interface, and HashUtil decides which hash algorithm should be invoked. The SHAUtil class and the MD5Util class are both algorithms for calculating hash values. The response of XMLUtil class is recording the configuration of HDFS and HBase. The responses of the HBaseManager class are communicating with HBase and operating HBase for cooperation in executing tasks. The response of the HDFSUtil class are communicating with HDFS, and operate HDFS for cooperation of execute tasks.

4.4. System implementation 59 4.4.3 Interface In order to operate DeDu conveniently, we have developed a graphical user interface (GUI).

73 4.4. System implementation Interface In order to operate DeDu conveniently, we have developed a graphical user interface (GUI). In this section, we will show and briefly describe the features of the GUI. The following picture (Figure 4.13) shows DeDu s main interface, before connecting to HDFS and HBase. Figure 4.13: DeDu s Main Interface Using the interface involves clicking options and then selecting connection configuration. A new configuration window with two folders named HDFS and HBase will come up to perform the function of connection. Figures 4.14 and 4.15 show the interface of connection configuration of HDFS and HBase.

74 4.4. System implementation 60 Figure 4.14: Configure HDFS Figure 4.15: Configure HBase After being connected to HDFS and HBase, DeDu s main interface shows the file s status on right side. Figure 4.16 displays the situation after the connection has been set up.

75 4.4. System implementation 61 Figure 4.16: After connecting to HDFS and HBase The action of uploading files into DeDu is quite easy. selected file or folder from the left side to the right side. Just drag and drop the Figure 4.17 displays the process of uploading a file.

drop selection of a file or folder from the right side to the left

76 4.4. System implementation 62 Figure 4.17: Uploading process. The action of downloading files to a local disk involves drag and drop selection of a file or folder from the right side to the left side. Figure 4.18 displays the downloading process. Figure 4.18: Downloading process.

4.5. Summary 63 The action of deleting a file or folder is achieved by right clicking the file or folder which is going to be deleted and selecting the option of delete.

77 4.5. Summary 63 The action of deleting a file or folder is achieved by right clicking the file or folder which is going to be deleted and selecting the option of delete. Then, the file or folder will be deleted. Figure 4.19 displays the deletion process. Figure 4.19: Deletion process 4.5 Summary This chapter firstly introduced DeDu, the duplication deleting storage system, and then described the method of identifying the duplication and storage mechanisms. Secondly, in section 4.2, it described the design of DeDu, which included data organization, procedures for uploading files, procedures for access to files, and procedures for deletion of files. Thirdly, in section 4.4, the environment of simulation was briefly described. Section 4.4 contains details on the system implementation, which covered system architecture, programs and interfaces.

78 Chapter 5 Experiment Results and Evaluation 5.1 Evaluations and summary of P2CP For a storage service, availability and speed are high priority considerations. In the previous section, we proved that the speed of P2CP is superior. In this section, we compare and discuss the availability of P2CP in comparison with other models from the point of view of the whole network and of shared resources. According the work of [7], we know that common hardware failures often occur in clusters. Hardware failures, e.g. of servers, are expected in a networked environment. In our comparative evaluation, we assume that the failure rate of each peer is 1%, and the failure rate of each server is 0.1%. We assume that two peers constitute the smallest pure P2P network; the smallest P2SP network includes one server and one peer; the smallest cloud includes one master node and one data node; and the smallest P2CP network includes one smallest cloud and one peer. 64

79 5.1. Evaluations and summary of P2CP Evaluation from network model availability From the point of view of whole network availability, and based on the assumptions above, we can see that: In the P2P network, even if only 1 peer exists in the P2P network, when the user connects to the peer, the P2P network can be set up. Thus the maximum failure rate of the pure P2P network is 1%. For the failure rate of the P2SP network, failure for one machine will not lead to breakdown of the whole P2SP network. If the server is shut down, the network becomes a P2P network; if the peer is offline, the network becomes client and server based. Only when both the server and peer break down at the same time, will the whole network be shutdown. Thus the maximum failure rate of the P2SP network is 1%*0.1% = 0.001%. For the failure rate of the cloud network, according the features of cloud, we know that no level of master or data node shut down will lead to the whole cloud network being fully disabled, unless both the master node and the data node are broken at the same time. So, the maximum failure rate of the cloud network is 1%*0.1% = 0.001%. The P2CP network could run without peers, even if there are failures to the master node or data nodes; only after all peers are gone and both master node and all data nodes are broken, will the whole P2CP network shut down. So, the failure rate of P2CP is 0.1%*1%*1% = %. Thus, in the worst network situation, the most stable network storage model is P2CP Evaluation from particular resource availability However, from the point of view of a particular shared resource, we know that the storage services follow the long tail law [2]. This means that the particular resource may be very popular at the beginning and in low demand in a long time. In the P2P

80 5.1. Evaluations and summary of P2CP 66 storage model, initially, the particular resource is frequently downloaded and uploaded in the network, so users can access the particular resource easily. However, when the particular resource is no longer popular, and the peers who hold the information for the particular resource leave, the P2P network is still there, but the resource is not available. Both the cloud storage model and the P2SP storage model solve this problem. They use a series of servers or a single server to record the particular resource to guarantee the availability, but with different transmission efficiency. The transmission efficiency of the P2SP storage model is improved when the particular resource is popular and the transmission efficiency is high, but when the particular resource is unpopular, the transmission efficiency is low. The cloud storage model gets the opposite result. Only the P2CP storage model achieves the best result. Regardless of whether the particular resource is in fashion, the availability and speed are very good Summary of P2CP In summary, from the evaluation results, we can clearly see that in whatever the situation, the cost in time for P2CP is the lowest, and the usability is highest. P2CP is a new cloud storage system model, developed through this research to enhance data transmission performance and provide persistent availability. P2CP not only solves the problem of utilization rate of bandwidth that exists in cloud storage systems, but also solves the problem of persistent availability in the pure P2P network model. By employing mathematical model, it proves that with the same network environment, the utilization rate of bandwidth and the persistent availability of the P2CP model are better than for the pure P2P model, the P2SP model, or the cloud model.

5.2. Performance of DeDu 67 5.2 Performance of DeDu 5.2.1 Deduplication efficiency In our experiment, we uploaded 110,000 files, amounting to 475.2 GB, into DeDu.

81 5.2. Performance of DeDu Performance of DeDu Deduplication efficiency In our experiment, we uploaded 110,000 files, amounting to GB, into DeDu. In a traditional storage system, they should occupy GB, as shown by the blue line in Figure 5.1, and if stored in a traditional distribution file system, both the physical storage space and the number of files should be three times larger than GB and 110,000 files, that is, GB and 330,000 files. (We did not show this in the figure, because the scale of the figure would become too large.) In a perfect deduplication distribution file system, the results should take up 37 GB, and the number of files should be 3,200; but in DeDu, we achieved 38.1 GB, just 1.1 GB larger than the perfect situation. The extra 1.1 GB of data are occupied by link files and the dedu table, which is saved in HBase. The following figure (Figure 5.1 ) shows the deduplication system efficiency. Figure 5.1: Deduplication efficiency.

82 5.2. Performance of DeDu 68 By using the distribution hashing index, an exact deduplication result is achieved. In our storage system, each file could only be kept in 3 copies at different data nodes, as backup in case some data nodes are dead. This means that if a file is saved into this system less than three times, the efficiency of deduplication is low. When a file is put into the system more than three times, the efficiency of deduplication will become high. Thus, the exact deduplication efficiency depends on both the original data duplication ratio and how many times the original data has been saved. The higher the duplication ratio the original data has, the greater the deduplication efficiency. It is also true that the greater the number of times that the original data is saved, the greater the deduplication efficiency that can be achieved Balance of load Because each data node keeps different numbers of blocks, and the client will directly get the data from the data nodes, we have to keep an eye on load balance, in case some data nodes are overloaded, while others are idle Static load balance Figure 5.2 shows the balance situation in 4 data nodes. In the situation of no deduplication, DN1 (mint2) stores 116 gigabytes of data; DN3 (mint4) stores 114 gigabytes of data; both DN2 (mint3) and DN4 (mint5) each store 115 gigabytes of data. With deduplication, DN2 stores 6.95 gigabytes of data; DN3 stores 6.79 gigabytes of data; and DN1 and DN3 each store 7 gigabytes of data. The perfectly deduplicated situation should occur when each node stores 6.9 gigabytes of data. It is easy to see that each data node stores a different amount of data, no matter whether it is at the hundred gigabytes level or at the dozens of gigabytes level. The usage of storage space in each

5.2. Performance of DeDu 69 node is different, but the differences between numbers of blocks and space occupation in the same situation will not be more than 10%. Figure 5.2: Static load balance. 5.2.2.2 Dynamic load balance When we delete a node or add a new node into the system, DeDu will achieve balance automatically.

83 5.2. Performance of DeDu 69 node is different, but the differences between numbers of blocks and space occupation in the same situation will not be more than 10%. Figure 5.2: Static load balance Dynamic load balance When we delete a node or add a new node into the system, DeDu will achieve balance automatically. The default communication bandwidth is 1 MB/s, and so, the balance efficiency is low, except when the balance command is entered manually. Figure 5.3 shows that the deduplication load is balanced in 3 data nodes environment, as indicated by brackets, and 4 data nodes environment. It is easy to see that when it is in the 3 data nodes environment, each data node stores 9.24 GB of data. After one more data node is added into the system, DN2 stores 6.95 gigabytes of data; DN3 stores 6.79 gigabytes of data; and DN1 and DN3 each store 7 gigabytes of data.

5.2. Performance of DeDu 70 Figure 5.3: Dynamic load balance 5.2.3 Reading efficiency This part introduces the results of system reading efficiency in two situations, with two data nodes and with four data nodes.

84 5.2. Performance of DeDu 70 Figure 5.3: Dynamic load balance Reading efficiency This part introduces the results of system reading efficiency in two situations, with two data nodes and with four data nodes. In the two data node situation, we tested the system with two data streams. Firstly, we downloaded 295 items amounting to 3.3 GB at a cost of 356 seconds. The download speed was 9.49 MB/s. Secondly, we downloaded 22 items, amounting to 9.2 GB at a cost of 510 seconds. The download speed was MB/s. In the four data node situation, we also tested with two data streams. Firstly, we downloaded 295 items, amounting to 3.3 GB, at a cost of 345 seconds, so the speed was 9.79 MB/s. Secondly, we downloaded 22 items amounting to 9.2 GB at a cost of 475 seconds, and the download speed was MB/s. Details are given in Table 5.1.

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second semester