When we last spoke about GPUs on our blog, it was during the SuperComputing 2010 conference when AWS announced their new cg1.4xlarge instance type. The response to our benchmarks for the Amazon CG1 instance for SC 2010 was phenomenal. As a quick review, cg1.4xlarge are the typical AWS “Cluster Compute” instance extended with a pair of Nvidia M2050 GPUs, 22 GB of memory, and a 10Gbps Ethernet interconnect.

Since we first published our Amazon GPU on CycleCloud benchmarks, the phone has been ringing off the hook at Cycle as we received interest in automatically creating clusters with shared file systems using CG1, high memory, and high-cpu instance types. As an example, we’ve created a 32-node / 64-GPU cluster that ran molecular dynamics apps in 1 month instead of 5 months thanks to the Tesla GPUs. When combined with the 8TB filer, this particular cluster costs less than $100 per hour to operate, and took about 10-15 minutes to spin up initially.

Given all this experience in automating clusters, we thought it was high time we shared some of what we found. First, we'll cover the what's and why's of GPU clusters on the cloud, then get into some data about how our experience has been, and cover our costs detail.

Overview

As a quick background, Amazon’s EC2 offerings now include the cg1.4xlarge instance type, the typical Cluster Compute Instance (CCI), extended with a pair of Nvidia M2050 GPUs. Access to the GPU units is through the standard CUDA toolkit installed on top a Cent OS 5 release. From an application development perspective nothing is different; you write your applications to access the GPU through CUDA libraries, or use some of the many applications already supporting GPUs, and they run on the EC2 instances the same as on a bare metal OS installation. Our benchmarks show that, from a performance perspective as well, everything runs at similar speeds to bare metal.

Why Clusters-as-a-Service for GPUs

A survey of our top GPU instance users has led us to some common reasons clients are using CycleCloud’s GPU clusters for their CUDA workloads:

  • Data center capabilities required to run Nvidia Tesla machines are demanding, i.e. power and cooling requirements for these servers are high, but well worth it for the speed-ups.
  • Months-long procurement processes for buying hardware make it cost- and time-prohibitive to grow large, internal deployments of these machines.
  • Some clients prefer to offload the complexity and costs of deploying and maintaining these machines on cloud providers, while others simply don’t have the physical facilities to deploy these machines at the desired scale.

With our CycleCloud clusters, these limits on deployment are handled by cloud providers, and we handle creating/securing/updating HPC environments so users can just run their GPU-accelerated applications.

We automate maintaining OS images, security patches, managing instances, and spinning up and down GPU nodes on demand, so clients can run critical apps easily. CycleCloud dynamically expands and contracts these clusters to handle peak demands at a fraction of the cost of owning the hardware and deploying it in a traditional, on-site approach to super computing.

This on-demand philosophy to accessing GPU-enabled servers significantly speeds deployment from months to minutes and lowers the risk associated with migrating computation-heavy tasks to GPUs. Unlike general CPU-based computing, if no one is using the GPU hardware it cannot be re-purposed for other computing tasks within in an organization.

With Amazon EC2’s GPU nodes, no large, upfront investment needs to be made to procure hardware in quantities large enough to entice your developers to target this approach when writing code. There’s no long term commitment to a single GPU model/architecture either: just like CPUs, as new ones become available, you can take advantage of them.

So with GPU runs increasing in CycleCloud, we have some observations and findings about cloud-based GPU computing that we think are interesting and worth sharing.

Stability

With one client's 64 GPU cluster, the hardware, drivers, and code have converged and are now used in production. At the beginning of using the GPU instances in Amazon, new applications using the Tesla instances required involvement from CycleCloud engineers to bring to a production-ready state. We directly investigated a range of issues, from kernel drivers, getting the ECC flag recognized on these nodes, integration with AWS to handle nodes that wouldn't run a particular application, customers’ application code, and workflows being ported to the cloud.

This is now as easy as cake. At this point, all the major issues with the GPU driver/CUDA stack have been ironed out, and moving GPU-accelerated code into the cloud is straightforward.

Data

Because the CG1 environments can achieve significant speedup (50x-60x in some cases) in job run time, much more computational work is being done in comparable time frames, consuming and producing much more data. CycleCloud’s shared filesystems have been scaled up to meet these demands.

As a result, we now offer shared filesystems in the 10s of TB range with a variety of different options for the ingress and egress of data. This scale is critical for workloads that keep these GPUs busy processing inputs and generating results for days on end.

Performance

The performance of GPU-enabled EC2 instances has been nothing short of superb. In all cases, the performance of the GPU-accelerated applications has been at or near bare metal speeds when run on the cloud-based instances versus bare metal in in-house data centers. Of course, this is no surprise but it’s always nice to have real-world application confirmation of what you’ve benchmarked. We reported on this excellent performance when the AWS EC2 GPU instances were launched.

Application Portability

We currently have a 100% success rate when it comes to porting GPU-accelerated code from internal environments in to Amazon EC2. With several conversions completed, the methods for performing application migration to the cloud have become quite streamlined and efficient. Considering the complexity of the drivers, the tool kits, the libraries and the compilers involved, we’re very excited about this! End users get to focus their time developing, compiling, and optimizing their code using internal workstations, and Cycle can efficiently migrate their efforts to the cloud.

Costs Defined

Below are costs for the 64 Tesla GPU cluster, with an 8TB filer:

ItemPrice (AWS + CycleCloud)QuantityCost/hr
CG1 instances$2.52/instance-hr32 Instances / 64 Tesla GPUs$80.64
Filer$0.138 / TB-hour (approx.)8 TB$1.11
TOTAL per Hour$81.7

Based upon the benchmarks from earlier, this yields a ~62 Tera-flop cluster for under $100/hr. Not too shabby, if we do say so ourselves.

Conclusion

GPU-based instances are easily available in the Cloud and through CycleCloud cluster provisioning. Awesome progress has been made by Amazon and Nvidia in stabilizing the GPU-accelerated compute environments in EC2, yielding fully working drivers and systems that just plain work. We are now able to offer these powerful GPU environments with a very consistent user experience, matching our more traditional CPU-based clusters.

Clusters spin up in 10-15 minutes, instead of waiting months to purchase. If you have an existing GPU workload, we can make it run within EC2. If you have a CPU-bound workload that you would like to migrate to GPU-accelerated methods we can help. Or if you just have ideas about how we do what we do with full clusters, please contact us with any questions.

ShareShare on FacebookShare on Google+Share on LinkedInTweet about this on TwitterEmail this to someone