A Couple More Nails in the Coffin of the Private Compute Cluster

Update: We're getting an overwhelming response to this entry, if you have questions come to booth #4638 at Supercomputing 2010
======
Cycle Computing has been in the business of provisioning large-scale computing environments within clouds such as Amazon EC2 for quite some time. In parallel, we have also built, supported, and integrated internal computing environments for Fortune 100s, universities, government labs, and SMBs with clusters of all shapes and sizes. Through work with clients including JPMorgan Chase, Pfizer, Lockheed Martin, Purdue University, among others, we have developed a keen sense for use cases that are most appropriate for either internal or external computing. More and more we see the lines blurring between internal and cloud case overall performance. This is good news for end users that want to have the flexibility to consume resources both internally and externally.

During the past few years it has been no secret that EC2 has been best cloud provider for massive scale, but loosely connected scientific computing environments. Thankfully, many workflows we have encountered have performed well within the EC2 boundaries. Specifically, those that take advantage of pleasantly parallel, high-throughput computing workflows.

EC2_HighThroughput

Still, the AWS approach to virtualization and available hardware has made it difficult to run workloads which required high bandwidth or low latency communication within a collection of distinct worker nodes. Many of the AWS machines used CPU technology that, while respectable, was not up to par with the current generation of chip architectures. The result? Certain use cases simply were not a good fit for EC2 and were easily beaten by in-house clusters in benchmarking that we conducted within the course of our research.

All of that changed when Amazon released their Cluster Compute offering. This took the massive flexibility and scale of “classic” EC2 and married it with current-generation HPC sensibilities.

The CPU cores? Blazingly fast Nehalems.

The virtualization technology? Near bare-metal speed HVM architecture.

The interconnect? 10 gigabit with the added bonus of being able to specify “placement groups” to automatically situate your compute farm in one tidy spot.

As supercomputing nerds, we were in heaven. Workloads that previously bottlenecked within “classic” EC2 were now running as fast, or faster, than they were on in-house systems. Now, Amazon has dropped another bombshell on the scientific computing community:

We can now provision Cluster Compute instances with a pair of Nvidia M2050 GPUs.

Yet another bastion of in-house supercomputing, GPU-accelerated programming, and perhaps the last excuse that many in-house users have to continue to maintain a private cluster in their data center, has fallen.

Do you want the ability to rent tens or even hundreds of these GPU-enabled nodes for as long as you want and then toss them back? Want GPU accelerated supercomputing without the floorspace, cooling, electricity and staffing costs? Yes, please!

SHOC and Awe

So, is it really as good as in-house hardware? In a word, yes. But you don’t have to take our word for it.

When we began early access to the GPU-enabled instances our primary concern was that the virtualization technology used in EC2 would interfere with the acceleration. Hampering throughput because of given bus speeds, memory bandwidth, hypervisor interactions and other virtualization bottlenecks. We ran a well known benchmark, SHOC (Scalable HeterOgeneous benChmarking suite), created by Jeremy Meredith at Oak Ridge National Laboratory’s Future Technologies group, both on non-virtualized in-house hardware owned by a well known client and on EC2 Cluster Compute hardware with GPU acceleration. Both types of hardware were using dual Nvidia M2050 GPU units.

So SHOC is our benchmark, but we need an Awe-some HPC cluster to run the tests against several machines. At Cycle, in 2007 we built CycleCloud the first Cluster as a Service on EC2, so now you can create fully-secured clusters on the cloud, with GPUs, in 10 minutes, instead of weeks or months.

As the live video below demonstrates, Cycle has created Condor/Torque/SGE clusters on Amazon EC2 using the new GPU instances, with shared file-systems of arbitrary size, encryption, and images that, through CycleCloud, we patch and maintain so you don’t have to!

Demo of starting GPU Clusters on Amazon EC2:

 

With a GPU Cluster in hand, we can run tests on many machines concurrently, to get stats on the performance of these new nodes! So we gave the SHOC tests and AWEsome CycleCloud GPU Clusters a spin with the following results:

PCIe Bandwidth

Repeated transfer of data of different sizes from hardware (H) to the GPU device (D) and back to measure the overall PCIe bandwidth available between the GPU and the motherboard. Note the capped PCIe speeds in the client hardware are likely due to motherboard and chipset differences.

Max FLOPS

Determines the peak FLOPS by executing a kernel composed of a large number of multiply and multiply-add operations.

Memory Bandwidth

Memory bandwidth measurements for global, constant and shared memory.

FFT

Performance measurements on a 1-D fast Fourier transform and inverse transform. FFT size was 256 MBytes.

MD

Performance measurements for a simple simple nbody pairwise computation. Dataset size was 73728 atoms.

Reduction

Performance measurements for a large sum reduction operation using single precision floating point data. Data size was 64 MBytes.

Scan

Performance measurements for an exclusive parallel prefix sum of single precision floating point data. Data size was 64 MBytes.

SGEMM

Performance measurements for device versions of the SGEMM BLAS routine, on a square matrix. Data set size was 16 kBytes.

Sort

Performance measurements for a radix sort on unsigned integer key-value pairs. Data set size was 96 MBytes.

Stencil 2D

Performance measurements for a 2D, 9-point single precision stencil computation. Reported times include PCIe transfer.

Triad

Performance measurements for a large vector dot product operation. Reported times include PCIe transfer. This is an OpenCL version of the stream triad benchmark.

Real World Application Benchmarking

Based on the positive results of our SHOC benchmarking, we approached a Fortune 500 Life Science and a Finance/Insurance clients who develop and use their own GPU-accelerated software, to run their applications on the GPU-enabled Cluster Compute nodes.

For both applications, the applications perform a large number of Monte Carlo simulations for given set of initial data, all pleasantly parallel.

The results, similar to the SHOC result, were that the EC2 GPU-enabled Cluster Compute nodes performed as well as, or better than, the in-house hardware maintained by our clients.

GPU Clusters in 10 minutes

With this new AWS offering, the line between internal hardware and virtualized, cloud-based hardware for high performance computing using GPUs has indeed been blurred!

CycleCloud provides the lowest cost, fastest entry point to accessing the new GPU-enabled Compute Cluster instances in EC2. Let us help you benchmark your GPU-accelerated HPC workloads in this new environment within EC2. You’re likely to be pleased.

If you’ve wanted to experiment with GPU-accelerated supercomputing here is your chance to access state-of-the-art hardware at prices anyone can afford. Risk-free assessment of GPU-accelerated computing is now a reality. Talk to Cycle Computing about our ultra-fast entry point for using these new machines in EC2. A few hours of research could be worth thousands of hours of saved computation time for your workloads.

Can you tell we’re excited about this?

If you are too, please contact us.

ShareShare on FacebookShare on Google+Share on LinkedInTweet about this on TwitterEmail this to someone