One Optimized I/O Configuration per HPC Application Leveraging I/O Configurability of Amazon EC2 Cloud Mingliang Liu, Jidong Zhai, Yan Zhai Tsinghua University Xiaosong Ma North Carolina State University Oak Ridge National Laboratory Wenguang Chen Tsinghua University APSys 2011, July 12 1 / 25
Introduction Outline 1 Introduction 2 Storage System Options in the Amazon EC2 CCI 3 Preliminary Applications Results 4 Conclusion 2 / 25
Introduction Background I/O becomes the bottleneck for many HPC applications Intensive I/O operations and concurrency One-size-fits-all I/O configuration There is a trend to migrate the HPC applications from traditional platforms to cloud Cloud provides tremendous flexibility in configuring I/O system Fully controlled virtual machines Easily deployed user scripts Multiple types of low-level devices Online device acquisition and migration 3 / 25
Introduction Motivation The Problem Can we employ I/O configurability of cloud for HPC apps? 4 / 25
Introduction The Problem Motivation Can we employ I/O configurability of cloud for HPC apps? Configurability lies in: Set up the specific file system at start up Explore different types of low-level devices Tune the file system inherent parameters 4 / 25
Introduction The Problem Motivation Can we employ I/O configurability of cloud for HPC apps? Configurability lies in: Set up the specific file system at start up Explore different types of low-level devices Tune the file system inherent parameters Challenges: The feasibility is highly workload-dependent Tradeoff between efficiency vs. cost-effectiveness 4 / 25
Introduction Amazon EC2 CCI CCI: Cluster Computing Instance Amazon s solution to HPC in Cloud Quad-core Intel Xeon X5570 CPU, with 23GB memory Interconnected by 10 Gigabit Ethernet Amazon Linux AMI, RedHat family OS with Intel MPI local block storage (Ephemeral) with 2*800 GB Elastic Block Store (EBS), attached as block storage devices Simple Storage Service (S3), key-value based object storage 5 / 25
Outline 1 Introduction 2 Storage System Options in the Amazon EC2 CCI 3 Preliminary Applications Results 4 Conclusion 6 / 25
File System Selection Different I/O access pattern and concurrency require different kinds of file system - shared vs. parallel NFS is enough for low I/O demands, simple to deploy A parallel file system (eg. PVFS, Lustre) can be employed to support large and shared file writes scale up well
File System Selection Different I/O access pattern and concurrency require different kinds of file system - shared vs. parallel NFS is enough for low I/O demands, simple to deploy A parallel file system (eg. PVFS, Lustre) can be employed to support large and shared file writes scale up well Easy to choose and setup 118 LOC bash for NFS, and 173 LOC bash for PVFS 7 / 25
Device Selection Considerations Devices differ in levels of abstraction and access interfaces. Storage Pros. Cons. S3 off-shelf, designed for Internet and database apps no POSIX Ephemeral Free non-persistent, only two disks EBS - persistent beyond instances - more disks available Charged
Device Selection Considerations Devices differ in levels of abstraction and access interfaces. Storage Pros. Cons. S3 off-shelf, designed for Internet and database apps no POSIX Ephemeral Free non-persistent, only two disks EBS - persistent beyond instances - more disks available Charged The choice depends on the needs of individual applications 8 / 25
File System Internal Parameters Options Description Sync mode NFS sync vs async write modes Device number Combining multiple disks into a software RAID0 I/O server number NFS single-server bottleneck, PVFS can employ many I/O servers Meta-data distribution PVFS can distribute metadata to multiple servers I/O Server Placement Part-time vs. dedicated I/O servers Data Striping striping factor, unit size
File System Internal Parameters Options Description Sync mode NFS sync vs async write modes Device number Combining multiple disks into a software RAID0 I/O server number NFS single-server bottleneck, PVFS can employ many I/O servers Meta-data distribution PVFS can distribute metadata to multiple servers I/O Server Placement Part-time vs. dedicated I/O servers Data Striping striping factor, unit size Again The choice depends on the needs of individual applications 9 / 25
Dedicated NFS, Ephemeral Disks Computing Intances... 10GB Ethernet NFS Server Ephemeral Disks There is 1 NFS I/O server deployed in one dedicated instance, mounting two ephemeral disks. 10 / 25
Parttime NFS Mounting Ephemeral Computing Intances... Ephemeral Disks 10GB Ethernet NFS Server There is 1 NFS I/O server deployed in one parttime instance, mounting two ephemeral disks. 11 / 25
Dedicated NFS Mounting EBS Disks Computing Intances... 10GB Ethernet NFS Server EBS Disks There is 1 NFS I/O server deployed in one dedicated instance, mounting 8 EBS disks into RAID0. 12 / 25
1 Dedicated PVFS I/O server Computing Intances... 10GB Ethernet PVFS Server Ephemeral Disks There is 1 PVFS I/O server deployed in one dedicated instance. 13 / 25
2 Dedicated PVFS I/O Servers Computing Intances... 10GB Ethernet PVFS Server PVFS Server There are 2 PVFS I/O servers deployed in two dedicated instances. 14 / 25
4 Dedicated PVFS I/O Servers Computing Intances... 10GB Ethernet PVFS Server PVFS Server PVFS Server PVFS Server There are 4 PVFS I/O servers deployed in the dedicated instances, low untilized. 15 / 25
4 Part-time PVFS I/O servers Ephemeral Disks Computing Intances... PVFS PVFS PVFS PVFS 10GB Ethernet There are 4 PVFS I/O servers deployed in the computing instances, working part-time. 16 / 25
Options: sync mode and device W rite B a n d w id th (M B /s ) 3 5 0 3 0 0 2 5 0 2 0 0 1 5 0 1 0 0 5 0 E B S -s y n c E p h e m e ra l-s y n c E B S -a s y n c E p h e m e ra l-a s y n c 0 1 6 M 1 2 8 M 5 1 2 M 1 G 2 G 4 G B lo c k S iz e (B y te ) Test by IOR benchmark in 16 processes at 2 nodes. 17 / 25
Options: sync mode and device W rite B a n d w id th (M B /s ) 3 5 0 3 0 0 2 5 0 2 0 0 1 5 0 1 0 0 5 0 E B S -s y n c E p h e m e ra l-s y n c E B S -a s y n c E p h e m e ra l-a s y n c 0 1 6 M 1 2 8 M 5 1 2 M 1 G 2 G 4 G B lo c k S iz e (B y te ) Test by IOR benchmark in 16 processes at 2 nodes. Observations: No obvious difference between EBS and ephemeral 17 / 25
Options: sync mode and device W rite B a n d w id th (M B /s ) 3 5 0 3 0 0 2 5 0 2 0 0 1 5 0 1 0 0 5 0 E B S -s y n c E p h e m e ra l-s y n c E B S -a s y n c E p h e m e ra l-a s y n c 0 1 6 M 1 2 8 M 5 1 2 M 1 G 2 G 4 G B lo c k S iz e (B y te ) Test by IOR benchmark in 16 processes at 2 nodes. Observations: No obvious difference between EBS and ephemeral Async mode prefers small block sizes 17 / 25
NFS Write Bandwidth 2 0 0 1 -d is k 2 -d is k s 4 -d is k s 8 -d is k s W rite B a n d w id th (M B /s ) 1 5 0 1 0 0 5 0 0 8 1 6 3 2 6 4 1 2 8 2 5 6 N u m o f P ro c e s s e s Combining 1,2,4,8 EBS disks into a software RAID0 18 / 25
NFS Write Bandwidth 2 0 0 1 -d is k 2 -d is k s 4 -d is k s 8 -d is k s W rite B a n d w id th (M B /s ) 1 5 0 1 0 0 5 0 0 8 1 6 3 2 6 4 1 2 8 2 5 6 N u m o f P ro c e s s e s Combining 1,2,4,8 EBS disks into a software RAID0 The RAID0 doesn t scale well - possibly because of the virtualized layer 18 / 25
Preliminary Applications Results Outline 1 Introduction 2 Storage System Options in the Amazon EC2 CCI 3 Preliminary Applications Results 4 Conclusion 19 / 25
Preliminary Applications Results BTIO: CLASS=C, SUBTYPE=full T o ta l E x e c u tio n T im e (s ) 1 7 0 1 6 0 1 5 0 1 4 0 1 3 0 1 2 0 1 1 0 1 0 0 9 0 8 0 7 0 6 0 5 0 4 0 3 0 2 0 1 0 0 1 6 3 6 6 4 8 1 1 0 0 1 2 1 N u m o f P ro c e s s e s N F S -D e d ic a te d N F S -P a rttim e P V F S -1 -S e rv e r P V F S -2 -S e rv e r P V F S -4 -S e rv e r P V F S -4 -S e rv e r-p a rt 20 / 25
Preliminary Applications Results BTIO: CLASS=C, SUBTYPE=full T o ta l E x e c u tio n T im e (s ) 1 7 0 1 6 0 1 5 0 1 4 0 1 3 0 1 2 0 1 1 0 1 0 0 9 0 8 0 7 0 6 0 5 0 4 0 3 0 2 0 1 0 0 1 6 3 6 6 4 8 1 1 0 0 1 2 1 N u m o f P ro c e s s e s N F S -D e d ic a te d N F S -P a rttim e P V F S -1 -S e rv e r P V F S -2 -S e rv e r P V F S -4 -S e rv e r P V F S -4 -S e rv e r-p a rt PVFS outperforms NFS configurations all the time 20 / 25
Preliminary Applications Results BTIO: CLASS=C, SUBTYPE=full T o ta l E x e c u tio n T im e (s ) 1 7 0 1 6 0 1 5 0 1 4 0 1 3 0 1 2 0 1 1 0 1 0 0 9 0 8 0 7 0 6 0 5 0 4 0 3 0 2 0 1 0 0 1 6 3 6 6 4 8 1 1 0 0 1 2 1 N u m o f P ro c e s s e s N F S -D e d ic a te d N F S -P a rttim e P V F S -1 -S e rv e r P V F S -2 -S e rv e r P V F S -4 -S e rv e r P V F S -4 -S e rv e r-p a rt PVFS outperforms NFS configurations all the time PVFS scales up by adding more I/O servers 20 / 25
Preliminary Applications Results BTIO: CLASS=C, SUBTYPE=full T o ta l E x e c u tio n T im e (s ) 1 7 0 1 6 0 1 5 0 1 4 0 1 3 0 1 2 0 1 1 0 1 0 0 9 0 8 0 7 0 6 0 5 0 4 0 3 0 2 0 1 0 0 1 6 3 6 6 4 8 1 1 0 0 1 2 1 N u m o f P ro c e s s e s N F S -D e d ic a te d N F S -P a rttim e P V F S -1 -S e rv e r P V F S -2 -S e rv e r P V F S -4 -S e rv e r P V F S -4 -S e rv e r-p a rt PVFS outperforms NFS configurations all the time PVFS scales up by adding more I/O servers Parttime I/O servers provide pretty good performance 20 / 25
Preliminary Applications Results BTIO: Total Cost Analysis Cost calculation Cost($) = Num computing instances Run time 1.6/3600 T o ta l C o s t ($ ) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 N F S -D e d ic a te d N F S -P a rttim e P V F S -1 -S e rv e r P V F S -2 -S e rv e r P V F S -4 -S e rv e r P V F S -4 -S e rv e r-p a rt 0.2 0.1 0.0 1 6 3 6 6 4 8 1 1 0 0 1 2 1 N u m o f P ro c e s s e s 21 / 25
Preliminary Applications Results POP: Parallel Ocean Program Process 0 carries out all I/O tasks via POSIX interface, which is very different from BTIO POP does not scale on EC2, due to its heavy communication with small messages 1 4 0 0 1 3 0 0 1 2 0 0 1 1 0 0 T o ta l E x e c u tio n T im e (s ) 1 0 0 0 9 0 0 8 0 0 7 0 0 6 0 0 5 0 0 4 0 0 3 0 0 2 0 0 1 0 0 0 1 6 3 2 6 4 1 2 8 1 9 6 N u m o f P ro c e s s e s N F S -p a rttim e P V F S -2 -S e rv e r P V F S -4 -S e rv e r P V F S -4 -S e rv e r-p a rt 22 / 25
Conclusion Outline 1 Introduction 2 Storage System Options in the Amazon EC2 CCI 3 Preliminary Applications Results 4 Conclusion 23 / 25
Conclusion Future Work The Problem Configure the I/O system for a HPC app automatically HPC Apps... Analyzing IO Patterns I/O Demands Mapping Model Optimized Configurations File System Devices Writing Mode Number of Server Placement I/O Options 24 / 25
Conclusion Conclusion Cloud enables users to build per-application I/O systems Our preliminary results hint that HPC app behaves differently with different I/O system configurations in cloud Configuration per app depends on its I/O access pattern and concurrency Tradeoff: cost vs. efficiency 25 / 25
Conclusion Conclusion Cloud enables users to build per-application I/O systems Our preliminary results hint that HPC app behaves differently with different I/O system configurations in cloud Configuration per app depends on its I/O access pattern and concurrency Tradeoff: cost vs. efficiency Thank you! 25 / 25