The Challenge: 4096-core Cluster

Back in December 2010, we discussed running a 2048-core cluster using CycleCloud, which was in effect renting a circa 2005 Top 20 supercomputer for two hours. After that run, we were given a use case from a client that required us to push the boundary even further with CycleCloud. The challenge at hand was running a large workflow on a 4096-core cluster, but could our software start and resolve issues in getting a 4096-core cluster up and running?  

Cycle engineers accepted the challenge and built a new cluster we’ll call “Oni”. The mission of CycleCloud is to make running large computational clusters in the cloud as easy as possible. There is a lot of work that must happen behind the scenes to provision clusters both at this scale and on-demand.

What kinds of issues did we run into as we prepared to scale out the CycleCloud service from building 2048-core cluster up to a whopping 4096-core Oni cluster?  This post covers three of these questions:

  1. Can we get 4096 cores from EC2 reliably?
  2. Can the configuration management software keep up?
  3. Can the scheduler scale?
  4. How much does a 4096-core cluster cost on CycleCloud?

 

    Question 1: Can We Get 4096 Cores from EC2 Reliably?

    We needed 512 c1.xlarge instances (each with 8 virtual cores) in EC2’s us-east region for this workload. This is a lot of instances! First, we requested that our client’s EC2 instance limit be increased. This is a manual process, but Cycle Computing has a great relationship with AWS and we secured the limit increase without issue.

    However, an increased instance limit doesn't guarantee availability.  How many instances will we actually get and how will they be distributed over the region’s Availability Zones?

    Spot Instance Pricing is No Indicator of Availability!

    Hoping to predict instance availability, we analyzed historical Spot Price for c1.xlarge instances during our previous run using cloudexchange to collect this information. Our request for 256 instances caused no significant change in Spot Price so we assumed we’d have no problem. You’ll see from the data below that we were wrong: Spot Price is no indicator of current availability! The Spot price was unchanged over the course of our run, despite receiving Out of Capacity errors in three out of four Availability Zones.

    Availability Zone Distribution and Failure Rates

    We staged our instance requests in batches of 64 to see if we could spread our instances evenly over the four Availability Zones in the us-east-1 region. The table below shows where our instances actually landed. We quickly realized that us-east-1a, us-east-1c and us-east-1d were full after receiving Out of Capacity errors from our requests. CycleCloud fail-over algorithms switched requests to Availability Zones with capacity on the fly. By the end, all of our instances were landing in us-east-1b.

    Time (EST)Instances RequestedAZ RequestedActual AZNotes
    13:08:0064us-east-1aus-east-1d1 DOA; 1 disk failure
    13:15:0065us-east-1bus-east-1b1 DOA
    13:25:0064us-east-1cus-east-1d3 not reachable
    13:30:0064us-east-1bus-east-1b
    13:36:0064us-east-1aus-east-1b1 DOA
    13:42:0064us-east-1dus-east-1b
    14:05:0064us-east-1cus-east-1b
    14:13:0081us-east-1bus-east-1b1 DOA
    14:18:001us-east-1bus-east-1b

    You can also see from the chart that a few instances (roughly 1.5%) either failed to launch, or were unusable after launch and had to be terminated and replaced. We saw three AWS failure modes:

    • AWS-Terminated: The instance goes from the starting state right into the terminated state (DOA).
    • Unreachable: The instance does not register with CycleCloud nor can we access it with SSH. It’s as though the instance has no network connectivity.
    • Disk failure: One instance failed to create its ephemeral disk.

    On recent large clusters we’ve seen a consistent 0.75-1.00% instance failure rate due to the first two modes.  We haven’t seen a repeat of the disk failure. These observations have led to new autoscaling features in CycleCloud that replace instances lost to these failure modes, ensuring that you get all the instances your workload requires.  As a result of our testing we are confident that we can reliably request and get 4096 cores from EC2.

    Question 2: Can the Configuration Management Software Keep Up?

    Now that we can get 512 of the 8-core, c1.xlarge instances, can we configure them reliably and quickly? CycleCloud uses Chef for configuration management. It is a great solution for automating software installation and configuration on large clusters.  However we weren’t certain the Chef infrastructure could keep up with the demand from 512 c1.xlarge instances.

    Prior Peak: 64 nodes

    Previous tests had shown that some new instances would time out while registering with our Chef back-end if we started more than 64 instances simultaneously. These timeouts exposed issues with the performance of configuration scripts, which we could stream-line. Thus, from a peak perspective, we knew to keep the individual requests to about 64 nodes.

    Prior Throughput: 64 nodes every 10 minutes

    Now as with many things in HPC there is both a concern about instantaneous peaks, as well as over all throughput. CycleCloud was able to get the cluster up and configured, but we learned a few things we could make better for throughput.

    Based upon our experience testing with this cluster, we beefed up our Chef infrastructure and streamlined our configuration scripts to help with scalability. We also benchmarked and discovered that with an increasing converge timing scheme that we invented, we could both increase the peak number of nodes and the nodes per unit time. When we first started testing Oni, we could handle 64 instances every 10 minutes.

    Post-improvements: 256 nodes every 5 minutes

    Post optimizations and infrastructure changes, we can now support simultaneous launches of up to 256 instances or 2048-cores (up 4x from the previous limit of 64 instances at a time limit) every 5 minute. And we even have ideas on how we will increase that number dramatically. Thanks to our experience and the software we’ve written to automate these processes, we haven’t lost a single instance due to Chef registration timeouts since this run. Now on to the scheduler…

     

    Question 3: Can the Scheduler Scale?

    For this specific workload, we had a end-user requirement to use the open source resource manager Torque. We ran a quick test with our new 4096-core cluster to see if we could schedule jobs quickly and reliably. We found that with the default settings Torque could not handle 4096 jobs.

    In doing tests for the 4096-core cluster, Torque’s scheduler failed several times and had to be tuned. The primary parameter that required tuning was the “alarm” parameter for pbs_sched.  The default for this parameter is quite short, but we modified this parameter to 3600 sec (1hr).  This avoided the pbs_sched process dying prematurely when large workloads were submitted.

    We then submitted 4096 test jobs but once around 3200 jobs were scheduled, the scheduler slowed to a crawl (see graph below around 17:40). Modifying Torque’s job_stat_rate can solve this problem. Most notably, increasing the job_stat_rate to five minutes or more causes the pbs_server daemon to query pbs_mom for job data far less often.  We now can run these Torque clusters at 4000-cores, but after this experience, we do recommend Condor for workloads of this size in the future, as it seems to scale much more gracefully.

    With this experience, we are now confident we can tune Torque to consistently handle over 4000 cores. It’s important to keep in mind that as you scale the number of jobs in your environment, you need to make sure that your scheduler is configured to handle the increased workload. Schedulers have many knobs that can be turned to maximize scalability and performance so you can’t just assume it will scale using the default configuration. In this case, for 4000 cores using Torque, set the ‘alarm’ to 3600 and ‘job_stat_rate’ to five minutes.

     

    Question 4: Costs Defined?

    Below are costs for the 512 node cluster, with an 1TB filer:

    Item

    Price (AWS + CycleCloud)

    Quantity

    Cost/hr

    C1.XLarge instances

    $0.816/instance-hr

    512 Instances x 8-cores

    $417.79

    Filer

    $0.138 / TB-hour (approx.)

    1 TB

    $0.14

    TOTAL per Hour

    $417.93

    So the Oni cluster, 4096-cores costs about $418/hr to operate.

     

    Conclusion

    We learned many things from building and scaling Oni, and were excited about the new software we’ve written to handle more error cases, scale configuration, and scheduling for wide clusters. In the end, we discovered:

    • We can get access to 512 instances with 4096 cores total
    • Over 80% of the instances landed in us-east-1b with the remainder in us-east-1d
    • We improved configuration to increase the scale, so we can now create hundreds of execute nodes at once, without any errors
    • For Torque, we used a couple of settings to stretch it into scaling at 4000 cores, but now appreciate Condor’s ease at scaling to these sizes
    • The availability of 512 c1.xlarge nodes just underlined the fact that cloud computing can really change the game for many types of calculations, by enabling you to run wide, and get results back faster for little extra cost

    Stay tuned for much larger CycleCloud clusters coming in the future!  Meanwhile, if you have a cluster-bound workload that you would like to migrate to large scale cluster sizes and core counts, we can help. Please contact us with any questions about this cluster or running HPC in the Cloud.

    ShareShare on FacebookShare on Google+Share on LinkedInTweet about this on TwitterEmail this to someone