Update: Since publishing this blog entry, our 50,000 core CycleCloud utility supercomputer has gotten great coverage by BusinessWeekTheRegister, the NY Times, the Wall Street Journal’s CIO Report, Ars Technica, TheVerge, among many others. And now it would run for $750/hr with the AWS spot pricing as of 6/22/2012! Click here to contact us for more information

By now, we've shown that our software is capable of spinning up cloud computing environments that run  at massive scale and produce real scientific results.  After some of our previous efforts, we realized we were onto something with the CycleCloud Cloud HPC and Utility Supercomputing concept. However, even we underestimated the scales researchers would want to use and the scope of the research that this would impact. 

Among the requests were some from a leader in computational chemistry research, Schrodinger. In collaboration with Nimbus Discovery, they needed to virtually screen 21 million molecule conformations, more than ever before, against one possible cancer target using their leading docking application, Glide. And they wanted to do it using a higher accuracy mode early-on in the process, which wasn’t possible before because it is so compute intensive!

This is exactly what we did with our latest 50,000 core utility supercomputer that CycleCloud provisioned on Amazon Web Services, code-named Naga.  And Schrodinger/Nimbus got useful results they wouldn't have seen without utility supercomputing. We will describe how we accomplished this below, and in future articles and future blog posts.

From a scale perspective, the most revolutionary concept implemented for Naga was scaling out all the components of an HPC environment. In our previous megaclusters, we performed a great deal of optimization to scale one scheduler to handle as many cores (slots) as possible. With Naga, it was clear that we needed to scale out multiple aspects of the system so we could use capacity wherever we found it.

We ran Naga across each of the 7 regions that AWS currently supports, scaling-out all supporting systems of a cluster (scheduling, software configuration, etc.) so we could use idle capacity wherever we found it. However, the magic really happened when we layered CycleServer on top of Naga and allowed our revolutionary new job submission algorithm to intelligently dole out work to each region based on real-time measurements from that region. Using this architecture, we had built ourselves a secured, automated 50,000 core supercomputer in under two hours using AWS infrastructure.

The amazing thing about this cluster was that it was launched via CycleCloud by the enduser and the jobs were submitted to the cluster through CycleServer by the enduser as well. All of the automation work that went into the Naga megacluster means that Cycle engineers largely got to sit back and monitor the progress of the workload as it progressed.

Here are the cold hard stats:

ResourceCount
Total Instances6742
- c1.xlarge instances5,435
- c1.medium instances779
- cc1.4xlarge instances297
- cc2.8xlarge instance231
cores51,132
RAM58.78-TB
AWS RegionsAll (7: us-east, us-west1, us-west2, eu-west, sa-east, ap-northeast,
ap-southeast )

Here is a map of the compute resources harnessed during the Naga run:

Here is an analysis of the staggering amount of compute power that Naga provided:

MetricCount
Compute Hours of Work109,927 hours
Compute Days of Work4,580 days
Compute Years of Work12.55 years
Ligand Count~21 million ligands
  

 

THE REAL IMPORTANT NUMBERS!

The most important thing about this run are the following two numbers:
Time to run the impossible science: ~3 hours
Cost at peak: $4828.85/hour!

You read that right, 20+ million dollars of infrastructure was used for $4,828 per hour. Wow!

As always, we used a large suite of CycleServer visualizations to monitor the run.  These include CycleCloud compute provisioning, Condor job scheduling visualizations, OS-level performance statistics, CycleServer-Grill to monitor OpsCode Chef, and others that we’ll be talking about in the coming weeks.  Here is a sampling of some of the dashboards we were monitoring in real time:


As you can see, Cyclecloud was able to launch an outrageous number of machines in AWS, our Chef environment converged all of them into useful workers, CycleServer was able to dole out jobs to Condor schedulers, the Condor schedulers were able to quickly fill the compute resources with jobs, and we were able to monitor the overall utilization of all 50,000 cores working in unison.  

 

THE RESULTS

 

As part of this run, Schrodinger, in collaboration with Nimbus Discovery, has already identified compounds they would not have seen if it weren't for CycleCloud Utility Supercomputing! Now they're acquiring the compounds, and will be testing them in the lab. 

And all of this was possible by under $5000 an hour in AWS fees. We’re talking about a supercomputer that would cost tens of millions of dollars to purchase, power, cool, and maintain. If that isn’t a transformational capability, we don’t know what is! 

 

There is no industry that won’t be impacted by this technology, from insurance to life science, manufacturing to computer-aided product design, aerospace to energy. So, if you have a big problem that you’d like to solve in a few hours instead of a few years, give us a call. If you don’t currently have a big problem, sit back and think about what big problems face your organization that could really move it forward, but you might never ask because it would take too much computation. It’s time to think about research differently. We’d love to introduce you to Utility Supercomputing using CycleCloud.

ShareShare on FacebookShare on Google+Share on LinkedInTweet about this on TwitterEmail this to someone