Almost a year ago, the Monash HPC team embarked on a journey to extend the Monash Campus Cluster (MCC), the university’s internal heterogeneous HPC workhorse, onto R@CMon and the wider NeCTAR Australian Research Cloud. This is an ongoing collaborative effort between the R@CMon architects and tech-crew, and the MCC team, which has long-standing and strong engagements with the Monash research community. Recently, this journey has been further enriched by the close coordination with the MASSIVE team, which will enhance the sharing of technical artefacts and learnings between the two teams.
By September 2014, the MCC-on-the-Cloud has grown to over 600 cores, spanning across three nodes on the Australian Research Cloud. Its size was only limited because the Research Cloud was full and awaiting a wave of new infrastructure to be put in place. Nevertheless, Monash researchers from Engineering, Science, and FIT have collectively used over 850,000 CPU-core hours. Preferring the “MCC service”, they have offered their NeCTAR allocations to be managed by the MCC team, rather than building a cluster and installing the software stack by themselves. From the researchers’ perspective, this has the twofold benefit of providing a consistent user experience to that of the dedicated MCC and freeing them from the burden of managing cloud instances, software deployment, queue management, etc.
Deploying a usable high-performance/high-throughput computing (HPC/HTC) service on the cloud poses many challenges. Users expect a certain robustness and guaranteed service availability typical of traditional clusters. All this must be achieved despite the fluidity and heterogeneity of the cloud infrastructure and nuances in service offerings across the Research Cloud nodes. For example, one user reported that jobs were cancelled by the scheduler because they exceeded the specified wall time limits, and we subsequently discovered that some MCC “cloud” compute nodes were running on oversubscribed hosts (contrary to NeCTAR architecture guidelines). Nevertheless, we can declare that our efforts have paid off – MCC-on-the-cloud is now operating and delivering the reliable HPC/HTC computing service wrapped in the classic MCC look-and-feel that Monash researchers have come to depend on. Despite the many challenges, we are convinced that this is a good way to drive the federation forward.
Now with R@CMon Phase 2 coming online, we have taken a step closer towards realising this aim of “high-performance” computing on the cloud. Equipped with Intel Ivy Bridge Xeon processors, R@CMon Phase 2 hardware stands out amidst the cloud of commodity hardware on most other NeCTAR nodes. These specialist servers are already proving invaluable for floating-point intensive MPI applications. In production runs of a three-dimensional Spectral-Element method code, we observed performance of nearly double on these Xeons as compared to the AMD Opteron nodes across most of the rest of the cloud, even when hyper-threading is enabled. By pinning the guest vCPUs to a range of hyper-threaded cores on the host, we achieved a further 50% performance improvement; this is effectively over 2.6x improvement to the “commodity” AMD nodes. We look forward to implement this vCPU pinning feature once it is natively supported in OpenStack Juno, the RC’s next version.
Thus, our journey continues… Once RDMA (Remote Direct Memory Access) is enabled on Phase 2, accelerated networking will make it feasible to run large-scale, multi-host MPI workloads. Achieving this will take us even closer to a truly high-performance computing environment on the cloud. Look out for MCC science stories and infrastructure updates soon!