Tag Archives: Infrastructure Stories

R@CMon hosted Australia’s first Ceph Day

Ceph Days are a series of regular events in support of the Ceph open source community. They now occur at locations all around the world. In November, R@CMon hosted Australia’s first Ceph Day. The day hosted 70-odd guests, many of which  were from interstate and a few from overseas. There participants were from the research sector, private industry and ICT providers.  It was a fantastic culmination of Australia’s growing Ceph community.

If you don’t already know, Ceph is basically an open-source technology for software-defined cluster-based storage.  It means our storage backend is essentially infinitely scalable, and our focus can shift to the access mechanisms for data.

Check out the promo:

R@CMon has pioneered the adoption of Ceph for accessible research data storage and at mid-2013 was the first NeCTAR Research Cloud node to provide un-throttled volume storage. R@CMon has also worked closely with was InkTank and now Redhat to develop the support model for such an enterprise (see Ceph Enterprise – a disruptive period in the storage marketplace).

The day began with the Ceph Community Director – Patrick McGarry. His presentation included information about the upcoming expanded Ceph metrics platform, what the Ceph User Committee has been up to, new community infrastructure for a better contributor experience, and revised open source governance.

Undoubtedly the highlight of the day was the joint talk given by R@CMon’s very own director – Steve Quenette and technical lead – Blair Bethwaite. Here we explain Ceph in the context of the 21st century microscope – the tool each researcher creates to do modern day research. We also explain how we technically approached creating our fabric.

R@CMon announced as a Mellanox “HPC Center of Excellence”

At SuperComputing 2015 in Austin our network/fabric partner Mellanox announced R@CMon (Monash University) as a “HPC Centre of Excellence. A core goal of the HPC CoE is to drive the technological innovations required for the next generation (exascale) supercomputing, whilst also ensuring that such an exascale computer is relevant to modern research. R@CMon is a stand out pioneer at converging cloud, HPC and data, all of which are key to the “next generation”.

“We see Monash as a leader in Cloud and HPC on the Cloud with Openstack, Ceph and Lustre on our Ethernet CloudX platform.” Sudarshan Ramachandran, Regional Sales Manager, Australia & New Zealand

From a fabric innovation point of view, it has been a very productive and exciting 24months for R@CMon. By early 2014 the internal Monash University HPC system “MCC” was burst onto the Research Cloud, allowing a researcher’s own merit the be leveraged with institutional investment. It also represents a shift towards soft HPC, where the size of a HPC system changes regularly with time. Earlier this year we announced our early adoption of RoCE (RDMA over Converged Ethernet) using Mellanox technologies. The meant the same fabric used for cloud networking could also be used for HPC and data storage backplanes.  In turn MCC on the R@CMon also enabled RDMA communications, that is, real HPC performance but on an otherwise orchestrated cloud.

 

Finally at the Tokyo OpenSack summit 2015, Mellanox announced R@CMon as debuting the World’s first 100G End-to-End Cloud. This technology eases scaling and heterogeneity of performance aspects. In particular, it sets the basis for processor and storage performance for peak and converged cloud/HPC needs. Watch this space!

 

 

R@CMon Storage

Our journey towards R@CMon Storage (Storage-as-a-Service)…

In May 2013 R@CMon went live with an OpenStack cell within the NeCTAR (Australian) Research Cloud confederation. It was an innovation in its own right, targeting the commodity end of both the fundamental and translational research needs of Australia (see R@CMon IDC Spotlight – AMD & DELL). Our technical partner, Dell, has successfully applied the design pattern to many other subsequent Research Cloud nodes, and many other OpenStack based private cloud deployments both nationally and internationally. Shortly after the launch of this initial IaaS compute cell, we introduced Ceph based volume storage, becoming the first volume storage service on the Research Cloud, and in doing so, instigated a collaboration with InkTank (now Redhat). By November 2014 R@CMon launched the “Phase 2” Specialist IaaS cell, an “e”-resource motivated by research that pushes boundaries. Within this cell R@CMon added an RDMA-able interconnect to our storage and compute fabric, instigating an innovative technical collaboration with Mellanox.

Thus R@CMon is an environment to build what we call “21st Century Microscopes” – where researchers orchestrate the instruments, compute, storage, analysis and visualisation themselves, looking down and tuning this 21st century lens, using big data and big computing to make new discoveries.

And accordingly, R@CMon is an environment for innovative data services for the long-tail (if you like – more ICT like). Unashamedly – Our instances of Ceph is what we can “enterprise”, whilst each user or tenant has their own needs on file protocol, capacity and latency.

R@CMon Storage is a collection of storage access methods and underlying storage infrastructure products. Why do we present storage as both front-ends and infrastructure? Because most users want access methods – it should just work, but most microscope builders want infrastructure – it should be a building block. R@CMon Storage is also the Monash operating centre to VicNode – where we explain some of these products.

We now have a series of R@CMon Storage products and services available – ranging from infrastructure products, access methods and data management.

 

Australia’s Largest University Selects Mellanox CloudX Platform and Open Ethernet Switch Systems for Nationwide Research Initiative

Yesterday Mellanox made the following press release – “Australia’s Largest University Selects Mellanox CloudX Platform and Open Ethernet Switch Systems for Nationwide Research Initiative“. Through Monash University’s own co-investment into R@CMon, the Mellanox Cloudx products were chosen as the networking technology to Phase 2, providing RDMA capable networking within and between R@CMon Research Cloud and Data (RDSI) facilities. This means our one fabric can run multi-host MPI workloads, and leverage fast I/O storage, but also remain near the cost-point of commodity networking for the resources that are generic and commodity.

This is a key ingredient to the “21st Century Microscope”, where researchers orchestrate the instruments, compute, storage, analysis and visualisation themselves, looking down and tuning this 21st century lens, using big data and big computing to make new discoveries. R@CMon has been designed to be the platform where Australian researchers can lead the way at establishing their own 21st century microscope – for themselves and for their communities.

Once again Monash is leading platform technology innovation and accessibility by example. Through 2015 we look forward to optimising this technology, and encouraging increased self-service to these sorts of technologies.

 

Download (PDF, Unknown)

The CVL on R@CMon Phase 2

Monash is home to the Multi-modal Australian ScienceS Imaging and Visualisation Environment (MASSIVE), a national facility for the imaging and characterisation community. An important and rather novel feature of the MASSIVE compute cluster is the interactive desktop visualisation environment available to assist users in the characterisation process. The MASSIVE desktop environment provided part of the inspiration for the Characterisation Virtual Laboratory (CVL), a NeCTAR VL project combining specialist software visualisation and rendering tools from a variety of disciplines and making them available on and through the NeCTAR research cloud.

The recently released monash-02 zone of the NeCTAR cloud provides enhanced capability to the CVL, bringing a critical mass of GPU accelerated cloud instances. monash-02 includes ten GPU capable hypervisors, currently able to provide up to thirty GPU accelerated instances via direct PCI passthrough. Most of these are NVIDIA GRID K2 GPUs (CUDA 3.0 capable), though we also have one K1. Special thanks to NVIDIA for providing us with a couple of seed units to get this going and supplement our capacity! After consultation with various users we created the following set of flavors/instance-types for these GPUs:

Flavor name#vcoresRAM (MB)/dev/vda (GB)/dev/vdb (GB)
mon.r2.5.gpu-k21540030N/A
mon.r2.10.gpu-k22108003040
mon.r2.21.gpu-k242170030160
mon.r2.63.gpu-k2126500030320
mon.r2.5.gpu-k11540030N/A
mon.r2.10.gpu-k12108003040
mon.r2.21.gpu-k142170030160
mon.r2.63.gpu-k1126500030320

R@CMon has so far dedicated two of these GPU nodes to the CVL, and this is our preferred method for use of this equipment, as the CVL provides a managed environment and queuing system for access (regular plain IaaS usage is available where needed). There were some initial hiccups getting the CVL’s base CentOS 6.6 image working with NVIDIA drivers on these nodes, solved by moving to a newer kernel, and some performance tuning tasks still remain. However, the CVL has now been updated to make use of the new GPU flavors on monash-02, as demonstrated in the following video…

GPU-accelerated Chimera application running on the CVL, showing the structure of human follicle-stimulating hormone (FSH) and its receptor.

If you’re interested in using GPGPUs on the cloud please contact the R@CMon team or Monash eResearch Centre.

MCC-on-R@CMon Phase 2 – HPC on the cloud

Almost a year ago, the Monash HPC team embarked on a journey to extend the Monash Campus Cluster (MCC), the university’s internal heterogeneous HPC workhorse, onto R@CMon and the wider NeCTAR Australian Research Cloud. This is an ongoing collaborative effort between the R@CMon architects and tech-crew, and the MCC team, which has long-standing and strong engagements with the Monash research community. Recently, this journey has been further enriched by the close coordination with the MASSIVE team, which will enhance the sharing of technical artefacts and learnings between the two teams.

By September 2014, the MCC-on-the-Cloud has grown to over 600 cores, spanning across three nodes on the Australian Research Cloud. Its size was only limited because the Research Cloud was full and awaiting a wave of new infrastructure to be put in place. Nevertheless, Monash researchers from Engineering, Science, and FIT have collectively used over 850,000 CPU-core hours. Preferring the “MCC service”, they have offered their NeCTAR allocations to be managed by the MCC team, rather than building a cluster and installing the software stack by themselves. From the researchers’ perspective, this has the twofold benefit of providing a consistent user experience to that of the dedicated MCC and freeing them from the burden of managing cloud instances, software deployment, queue management, etc.

Deploying a usable high-performance/high-throughput computing (HPC/HTC) service on the cloud poses many challenges. Users expect a certain robustness and guaranteed service availability typical of traditional clusters. All this must be achieved despite the fluidity and heterogeneity of the cloud infrastructure and nuances in service offerings across the Research Cloud nodes. For example, one user reported that jobs were cancelled by the scheduler because they exceeded the specified wall time limits, and we subsequently discovered that some MCC “cloud” compute nodes were running on oversubscribed hosts (contrary to NeCTAR architecture guidelines). Nevertheless, we can declare that our efforts have paid off – MCC-on-the-cloud is now operating and delivering the reliable HPC/HTC computing service wrapped in the classic MCC look-and-feel that Monash researchers have come to depend on. Despite the many challenges, we are convinced that this is a good way to drive the federation forward.

Now with R@CMon Phase 2 coming online, we have taken a step closer towards realising this aim of “high-performance” computing on the cloud. Equipped with Intel Ivy Bridge Xeon processors, R@CMon Phase 2 hardware stands out amidst the cloud of commodity hardware on most other NeCTAR nodes. These specialist servers are already proving invaluable for floating-point intensive MPI applications. In production runs of a three-dimensional Spectral-Element method code, we observed performance of nearly double on these Xeons as compared to the AMD Opteron nodes across most of the rest of the cloud, even when hyper-threading is enabled. By pinning the guest vCPUs to a range of hyper-threaded cores on the host, we achieved a further 50% performance improvement; this is effectively over 2.6x improvement to the “commodity” AMD nodes. We look forward to implement this vCPU pinning feature once it is natively supported in OpenStack Juno, the RC’s next version.

Measured performance improvement with a production 3D Spectral Element code R@CMon Phase 1: AMD Opteron 6276 @ 2.3 GHz                 Phase 2: Intel Xeon E5-4620v2 @ 2.6 GHz

Measured performance improvement with a production 3D Spectral Element code
R@CMon Phase 1: AMD Opteron 6276 @ 2.3 GHz
Phase 2: Intel Xeon E5-4620v2 @ 2.6 GHz

Thus, our journey continues… Once RDMA (Remote Direct Memory Access) is enabled on Phase 2, accelerated networking will make it feasible to run large-scale, multi-host MPI workloads. Achieving this will take us even closer to a truly high-performance computing environment on the cloud. Look out for MCC science stories and infrastructure updates soon!

R@CMon Phase 2 is here!

Back in 2012 our submission to NeCTAR planned R@CMon as being delivered in two phases. First a commodity phase, letting the ideals of en masse computing dominate technical choices. We have been operating phase 1 since May 2013. Our new specialist second phase went live in October! R@Cmon phase 2 (R@CMon RDC cell) scales out high-performing and accelerating hardware as driven by the demands of the precinct. Often ‘big data’ is just not possible without ‘big memory’ to hold the problem space without going to disk (x100 slower). Often ‘more memory’ is the barrier, not ‘more cores’. Often ‘I need to interact with a 3D model’. And so on. R@CMon is truly now a scalable and critical mass of self-service, on-demand computing infrastructure. It is also the play-pit where research leaders can build their own 21st century microscopes.

NeCTAR monash-02 rack-rear

One of the four racks of NeCTAR monash-02. From top to bottom: Mellanox 56G switches, management switch, R820 compute nodes, R720 Ceph storage nodes

In addition to phase 1, phase 2 has –

  • 2064 new Intel virtual cores
  • 3 nodes with 1TB of RAM
  • 10 nodes with GPUs for 3d desktops
  • 3 nodes (the large memory ones) with high-performance PCIe SSD
  • All standard compute nodes mix SAS & SSD for low-latency local ephemeral storage
  • All nodes with RDMA (Remote Direct Memory Access – the stuff that makes fast, large-scale, multi-node HPC jobs possible) capable networking

As with phase 1, the entire infrastructure is orchestrated through OpenStack and presented on the Australian Research Cloud. R@CMon is once again pioneering research cloud infrastructure, virtualising all these specialist resources.

Over the next week we’ll blog with emerging examples of GPUs, SSDs and 1TB memory machines…

R820 1TB RAM compute node

One of the specialist nodes – a quad-socket R820 with 1TB RAM and high-performance PCIe-attached flash