Tag Archives: All Stories

Interferome on R@CMon

Interferons (IFNs) were identified as antiviral proteins more than 50 years ago. However, their involvement in immunomodulation, cell proliferation, inflammation and other homeostatic process has since been identified. These cytokines are used as therapeutics in many diseases such as chronic viral infections, cancer and multiple sclerosis. These IFNs regulate the transcription of approximately 2000 genes in a IFN subtype, dose, cell type and stimulus dependent manner. 

Interferome Wordle

Interferome Wordle

Interferome is an online database of IFN regulated genes.  The database is a valuable resource for biomedical researchers, being regularly used by scientists from across the world. This database of IFN regulated genes is an attempt at integrating information from high-throughput experiments to gain a detailed understanding of IFN biology. Interferome enables reliable identification of an individual Interferon Regulated Gene (IRG) or IRG signatures from high-throughput data sets (i.e. microarray, proteomic data etc.). It also assists in identifying regulatory elements, chromosomal location and tissue expression of IRGs in humans and mice.

Interferome Database Statistics

Interferome Database Statistics

The R@CMon team assisted Prof. Paul Hertzog and the Centre of Innate Immunity & Infectious Diseases at MIMR-PHI in migrating versions 1.0 and 2.0 of the Interferome online database into the NeCTAR Research Cloud. Interferome Version 2.0 has quantitative data, more detailed annotation and search capabilities and can be queried for one gene or thousands as in a gene list from a microarray experiment. To ensure availability of data and assist researchers with hypothesis generation and novel biological discoveries, the Interferome database is backed by VicNode Collection 2014R9.06. More information about Interferome is available on the help page.

Bioplatforms Australia – CSIRO NGS Workshop (July 1-3, 2014)

Last July 1-3, 2014, the latest Bioplatforms Australia – CSIRO joint Next Generation Sequencing hands-on workshop was held at the University of New South Wales, Sydney. The workshop was delivered using the established Bioinformatics Training Platform running on the NeCTAR Research Cloud and provided bench biologists and PhD students with NGS training on the following topics:

      • Introduction to the command-line interface – Software Carpentry
      • Introduction to Next Generation Sequencing
      • Illumina Next Generation Sequencing Data Quality
      • Sequence Alignment Algorithms
      • ChIP-Seq Analysis
      • RNA-Seq Analysis
      • de novo Genome Assembly
Sequence data quality analysis and visualisation using FastQC and FASTX-Toolkit.

Sequence data quality analysis and visualisation using FastQC and FASTX-Toolkit.

The R@CMon team helped the workshop organisers in updating the training environment with the latest tools, datasets and other materials as well as ensuring resource stability throughout the 3 day workshop. Future Bioplatforms Australia and CSIRO joint workshops will be announced on the Bioplatforms Australia Training page.

Screen Shot 2014-07-16 at 12.40.46 pm

Alignment visualisation using IGV.

The trainees have the following to say about the workshop:

“The practical component made it 1000 times easier to get my head around the course and I feel like I can be confident in actually applying what I’ve learned (instead of just in lecture format).”

“The beginning with introduction to Unix environment and explanation of the de novo assembly was the best part of the course as the commands were described in more detail so I could understand what the different commands were executing. There was more practical work with the de novo assembly which was good.”

“Hands on experience is good, and the first part on command lines is good for the beginners.”

Spreadsheet of death

R@CMon, thanks to the Monash eResearch Centre’s long history of establishing “the right hardware for research”, prides itself on effectiveness at computing, orchestrating and storing for research. In this post we highlight an engagement that didn’t yield an “effectiveness” to our liking, and how that helped shape elements of the imminent R@CMon phase2.

In the latter part of 2013 the R@CMon team was approached by a visiting student working at the Water Sensitive Cities CRC. His research project involved parameter estimation for an ill-posed problem in ground-water dynamics. He had setup (perhaps partially inherited) an Excel spreadsheet based Monte Carlo engine for this, with a front-end sheet providing input and output to a built in VBA macro for the grunt work – an erm… interesting approach! This had been working acceptably in the small, as he could get an evaluation done within 24 hours on his desktop machine (quad core i7). But now he needed to scale up and run 11 different models, and probably a few times each to tweak the inputs.  Been there yourself?  This is a very common pattern!

Nash-Sutcliffe model efficiency

Nash-Sutcliffe model efficiency (Figure courtesy of eng. Antonello Mancuso, PhD, University of Calabria, Italy)

MCC (the Monash Campus Cluster), our first destination for ‘compute’, doesn’t have any Windows capability, and even if it did, attempting to run Excel in batch mode would have been something new for us. No problem we thought, we’ll use the RC, give him a few big Windows instances and he can spread the calculations across them. Not an elegant or automated solution for sure, but this was a one-off with some tight time constraints, so it was more important to start calculations than get bogged down with a nicer solution.

It took a few attempts to get Windows working properly. We eventually found the handy cloudbase solutions trial image and its guidance documentation. But we also ran into issues activating Windows against the Monash KMS, turns out we had to explicitly select our local network time source as opposed to the default time.windows.com. We also found some problems with the CPU topology that Nova was giving our guests, Windows was seeing multiple sockets rather than multiple cores, which meant desktop variants were out as they would ignore most of the cores.

Soon enough we had a Server 2012 instance ready for testing. The user RDP’d in and set the cogs turning. Based on the first few Monte Carlo iterations (out of the million he needed for each scenario) he estimated it would take about two days to complete a scenario, quite a lot slower than his desktop but still acceptable given the overall scale-out speed up. However, on the third day after about 60 hours compute time he reported it was only 55% complete. Unfortunately that was an unsustainable pace – he needed results within a fortnight – and so with his supervisor they resolved to code and use a different statistical approach (using PEST) that would be more amenable to cluster-computing.

We did some rudimentary performance investigation during the engagement and didn’t find any obvious bottlenecks, the guest and host were always very CPU busy, so it seemed largely attributable to the lesser floating point capabilities of our AMD Bulldozer CPUs. We didn’t investigate deeply in this case and no doubt there could be other elements at play here (maybe Windows was much slower for compute on KVM than Linux), but this is now a pattern we’ve seen with floating point heavy workloads across operating systems and on bare metal. Perhaps code optimisations for the shared FPU in the Bulldozer architecture can improve things, but that’s hardly a realistic option for a spreadsheet.

The AMDs are great (especially thanks to their price) for general purpose cloud usage, that’s why the RC makeup is dominated by them and why commercial clouds like Azure use them. But for R@CMon’s phase2 we want to cater to performance sensitive as well as throughput oriented workloads, which is why we’ve deployed Intel CPUs for this expansion. Monash joins the eRSA and NCI Nodes in offering this high-end capability. More on the composition of R@CMon phase 2 in the coming weeks!

Ceph Enterprise – a disruptive period in the storage marketplace

Recently R@CMon signed an agreement for Inktank Ceph Enterprise (aka ICE). Inktank is an open-source development and professional services company that spun out of DreamHost two years ago. And boy have they have been pretty busy since then – not only building an international open-source community around Ceph, but also solidifying the product through several major releases in conjunction with exploring enterprise support business models. It’s both the innovation rate and the professional maturity that drew us to Ceph. The next release of ICE (due this month) includes support for Erasure Coding (think distributed RAID) and cache-tiering (think SSD performance for near spinning-disk cost)!

The R@CMon – Ceph journey began in September last year where we introduced block storage based on Ceph to the NeCTAR Research Cloud. Others have been paying attention too and now several other NeCTAR Nodes (NCI, TPAC and QCIF so far) are using Ceph to meet their block storage needs. Monash now has just shy of 400TB raw capacity colocated with the monash-01 compute zone/cell and another 500TB coming with monash-02 deployment. With other developments in the pipeline we expect to have well over 1PB of storage managed by Ceph by the end of the year!

One of the things that came with the release of ICE was a closed-source management tool, Calamari, which provides the sort of graphical status, monitoring and configuration dashboard you might expect from “enterprise” software (how that term makes us cringe!). Calamari is starting to look pretty slick and is certainly one piece of the puzzle transitioning a storage solution adopted and built in the eResearch Healthy Hot-House environment to a robust service delivery practice – getting woken up at 2am in the morning if a storage node dies is not in my contract!

Ceph Calamari GUI

Calamari OSD Workbench

Not long after we’d signed up for ICE, Inktank was acquired by Red Hat for a cool $175mil. This announcement was surprising in that it was unexpected – Red Hat already has an iron in the software-defined storage fire in Red Hat Storage Server (GlusterFS wearing a fedora). But once digested this appears to be a very astute move by Red Hat as Ceph has been the sweetheart of storage for OpenStack deployers, thanks largely to its ability to converge object/block/filesystem, something that took the NAS-oriented Gluster a while to catch up on. It seems Red Hat now have a firm grip on software-defined storage!

Red Hat’s acquisition of Inktank is good news for us as it will ultimately mean a supported version of Ceph on an enterprise Linux distribution we have a broad skills base in, with local technical support personnel to back it all up, and in a much shorter timeframe than Inktank could have delivered on its own. And true to their upstream-first philosophy, Red Hat has already open-sourced Calamari.

It’s an exciting and disruptive period in the storage market!

VISIONET on R@CMon

VISIONET (Visualizing Transcriptomic Profiles Integrated with Overlapping Transcription Factor Networks) is a visualisation web service for cellular regulatory network studies. It’s been developed as a tool for creating human-readable visualisations of transcription factor networks from user’s microarray and ChiP-seq input data. VISIONET’s node-filtering feature provides a more human-readable large networks visualisation compared to CellDesigner and Cytoscape

Gata4_Tbx20

Gata4-Tbx20 transcription factor network.

R@CMon helped SBI Australia in porting the VISIONET web service into the NeCTAR Research Cloud, enabling rapid development and customisation. VISIONET’s .NET-based framework is now running on a Windows Server 2012 instance inside R@CMon, and it’s now using persistent storage (Volumes) for storing large generated network visualisations. VISIONET is now publicly available to biologists, and user traffic is expected to grow in the near future.