Tag Archives: Infrastructure Stories

Australia’s Largest University Selects Mellanox CloudX Platform and Open Ethernet Switch Systems for Nationwide Research Initiative

Yesterday Mellanox made the following press release – “Australia’s Largest University Selects Mellanox CloudX Platform and Open Ethernet Switch Systems for Nationwide Research Initiative“. Through Monash University’s own co-investment into R@CMon, the Mellanox Cloudx products were chosen as the networking technology to Phase 2, providing RDMA capable networking within and between R@CMon Research Cloud and Data (RDSI) facilities. This means our one fabric can run multi-host MPI workloads, and leverage fast I/O storage, but also remain near the cost-point of commodity networking for the resources that are generic and commodity.

This is a key ingredient to the “21st Century Microscope”, where researchers orchestrate the instruments, compute, storage, analysis and visualisation themselves, looking down and tuning this 21st century lens, using big data and big computing to make new discoveries. R@CMon has been designed to be the platform where Australian researchers can lead the way at establishing their own 21st century microscope – for themselves and for their communities.

Once again Monash is leading platform technology innovation and accessibility by example. Through 2015 we look forward to optimising this technology, and encouraging increased self-service to these sorts of technologies.

 

Download (PDF, Unknown)

The CVL on R@CMon Phase 2

Monash is home to the Multi-modal Australian ScienceS Imaging and Visualisation Environment (MASSIVE), a national facility for the imaging and characterisation community. An important and rather novel feature of the MASSIVE compute cluster is the interactive desktop visualisation environment available to assist users in the characterisation process. The MASSIVE desktop environment provided part of the inspiration for the Characterisation Virtual Laboratory (CVL), a NeCTAR VL project combining specialist software visualisation and rendering tools from a variety of disciplines and making them available on and through the NeCTAR research cloud.

The recently released monash-02 zone of the NeCTAR cloud provides enhanced capability to the CVL, bringing a critical mass of GPU accelerated cloud instances. monash-02 includes ten GPU capable hypervisors, currently able to provide up to thirty GPU accelerated instances via direct PCI passthrough. Most of these are NVIDIA GRID K2 GPUs (CUDA 3.0 capable), though we also have one K1. Special thanks to NVIDIA for providing us with a couple of seed units to get this going and supplement our capacity! After consultation with various users we created the following set of flavors/instance-types for these GPUs:

Flavor name#vcoresRAM (MB)/dev/vda (GB)/dev/vdb (GB)
mon.r2.5.gpu-k21540030N/A
mon.r2.10.gpu-k22108003040
mon.r2.21.gpu-k242170030160
mon.r2.63.gpu-k2126500030320
mon.r2.5.gpu-k11540030N/A
mon.r2.10.gpu-k12108003040
mon.r2.21.gpu-k142170030160
mon.r2.63.gpu-k1126500030320

R@CMon has so far dedicated two of these GPU nodes to the CVL, and this is our preferred method for use of this equipment, as the CVL provides a managed environment and queuing system for access (regular plain IaaS usage is available where needed). There were some initial hiccups getting the CVL’s base CentOS 6.6 image working with NVIDIA drivers on these nodes, solved by moving to a newer kernel, and some performance tuning tasks still remain. However, the CVL has now been updated to make use of the new GPU flavors on monash-02, as demonstrated in the following video…

GPU-accelerated Chimera application running on the CVL, showing the structure of human follicle-stimulating hormone (FSH) and its receptor.

If you’re interested in using GPGPUs on the cloud please contact the R@CMon team or Monash eResearch Centre.

MCC-on-R@CMon Phase 2 – HPC on the cloud

Almost a year ago, the Monash HPC team embarked on a journey to extend the Monash Campus Cluster (MCC), the university’s internal heterogeneous HPC workhorse, onto R@CMon and the wider NeCTAR Australian Research Cloud. This is an ongoing collaborative effort between the R@CMon architects and tech-crew, and the MCC team, which has long-standing and strong engagements with the Monash research community. Recently, this journey has been further enriched by the close coordination with the MASSIVE team, which will enhance the sharing of technical artefacts and learnings between the two teams.

By September 2014, the MCC-on-the-Cloud has grown to over 600 cores, spanning across three nodes on the Australian Research Cloud. Its size was only limited because the Research Cloud was full and awaiting a wave of new infrastructure to be put in place. Nevertheless, Monash researchers from Engineering, Science, and FIT have collectively used over 850,000 CPU-core hours. Preferring the “MCC service”, they have offered their NeCTAR allocations to be managed by the MCC team, rather than building a cluster and installing the software stack by themselves. From the researchers’ perspective, this has the twofold benefit of providing a consistent user experience to that of the dedicated MCC and freeing them from the burden of managing cloud instances, software deployment, queue management, etc.

Deploying a usable high-performance/high-throughput computing (HPC/HTC) service on the cloud poses many challenges. Users expect a certain robustness and guaranteed service availability typical of traditional clusters. All this must be achieved despite the fluidity and heterogeneity of the cloud infrastructure and nuances in service offerings across the Research Cloud nodes. For example, one user reported that jobs were cancelled by the scheduler because they exceeded the specified wall time limits, and we subsequently discovered that some MCC “cloud” compute nodes were running on oversubscribed hosts (contrary to NeCTAR architecture guidelines). Nevertheless, we can declare that our efforts have paid off – MCC-on-the-cloud is now operating and delivering the reliable HPC/HTC computing service wrapped in the classic MCC look-and-feel that Monash researchers have come to depend on. Despite the many challenges, we are convinced that this is a good way to drive the federation forward.

Now with R@CMon Phase 2 coming online, we have taken a step closer towards realising this aim of “high-performance” computing on the cloud. Equipped with Intel Ivy Bridge Xeon processors, R@CMon Phase 2 hardware stands out amidst the cloud of commodity hardware on most other NeCTAR nodes. These specialist servers are already proving invaluable for floating-point intensive MPI applications. In production runs of a three-dimensional Spectral-Element method code, we observed performance of nearly double on these Xeons as compared to the AMD Opteron nodes across most of the rest of the cloud, even when hyper-threading is enabled. By pinning the guest vCPUs to a range of hyper-threaded cores on the host, we achieved a further 50% performance improvement; this is effectively over 2.6x improvement to the “commodity” AMD nodes. We look forward to implement this vCPU pinning feature once it is natively supported in OpenStack Juno, the RC’s next version.

Measured performance improvement with a production 3D Spectral Element code R@CMon Phase 1: AMD Opteron 6276 @ 2.3 GHz                 Phase 2: Intel Xeon E5-4620v2 @ 2.6 GHz

Measured performance improvement with a production 3D Spectral Element code
R@CMon Phase 1: AMD Opteron 6276 @ 2.3 GHz
Phase 2: Intel Xeon E5-4620v2 @ 2.6 GHz

Thus, our journey continues… Once RDMA (Remote Direct Memory Access) is enabled on Phase 2, accelerated networking will make it feasible to run large-scale, multi-host MPI workloads. Achieving this will take us even closer to a truly high-performance computing environment on the cloud. Look out for MCC science stories and infrastructure updates soon!

R@CMon Phase 2 is here!

Back in 2012 our submission to NeCTAR planned R@CMon as being delivered in two phases. First a commodity phase, letting the ideals of en masse computing dominate technical choices. We have been operating phase 1 since May 2013. Our new specialist second phase went live in October! R@Cmon phase 2 (R@CMon RDC cell) scales out high-performing and accelerating hardware as driven by the demands of the precinct. Often ‘big data’ is just not possible without ‘big memory’ to hold the problem space without going to disk (x100 slower). Often ‘more memory’ is the barrier, not ‘more cores’. Often ‘I need to interact with a 3D model’. And so on. R@CMon is truly now a scalable and critical mass of self-service, on-demand computing infrastructure. It is also the play-pit where research leaders can build their own 21st century microscopes.

NeCTAR monash-02 rack-rear

One of the four racks of NeCTAR monash-02. From top to bottom: Mellanox 56G switches, management switch, R820 compute nodes, R720 Ceph storage nodes

In addition to phase 1, phase 2 has –

  • 2064 new Intel virtual cores
  • 3 nodes with 1TB of RAM
  • 10 nodes with GPUs for 3d desktops
  • 3 nodes (the large memory ones) with high-performance PCIe SSD
  • All standard compute nodes mix SAS & SSD for low-latency local ephemeral storage
  • All nodes with RDMA (Remote Direct Memory Access – the stuff that makes fast, large-scale, multi-node HPC jobs possible) capable networking

As with phase 1, the entire infrastructure is orchestrated through OpenStack and presented on the Australian Research Cloud. R@CMon is once again pioneering research cloud infrastructure, virtualising all these specialist resources.

Over the next week we’ll blog with emerging examples of GPUs, SSDs and 1TB memory machines…

R820 1TB RAM compute node

One of the specialist nodes – a quad-socket R820 with 1TB RAM and high-performance PCIe-attached flash

MyTardis + Salt on R@CMon

The Australian Synchrotron generates terabytes of data daily from a range of scientific instruments. In the past, this crucially important data often ended up on old disk drives, unlabelled and offline. This makes sharing and referencing of the data very difficult if not impossible. MyTardis was developed as a web application geared towards receiving data from scientific instruments such as those at the Australian Synchrotron in an automated fashion. This enables researchers to facilitate the organisation, long-term archival and sharing of data with collaborators. MyTardis also allows researchers to cite their data. This has been done for publications in high impact journals such as Science and Nature. Refer to this previous post highlighting Australian Synchrotron’s new data service using MyTardis, R@CMon and VicNode

A data publication for Astronomy

A data publication for Astronomy

Deploying MyTardis for development use or on a multi-node cloud setup is made easier with the use of automated configuration management tools. MyTardis is now using Salt for its configuration management, which is gaining popularity for its accessibility and simplicity. More information about deploying MyTardis on various platforms and how it uses Salt is available in this story. The following diagram provides an overview of MyTardis deployment architecture on the NeCTAR Research Cloud.

MyTardis Deployment Architecture

MyTardis Deployment Architecture

Since its creation, development and deployment of MyTardis have expanded to other universities, scientific facilities and institutions to fulfil the data management requirements across research areas such as microscopy, microanalysis, particle physics, next-generation sequencing and medical imaging. An up-to-date list of scientific instruments integrated with MyTardis is included in this story on the MyTardis site.