Tag Archives: Bioinformatics

iLearn on R@CMon

An integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data : the impact of making machine learning good practice readily available to the community.

Associate Professor Jiangning Song is a long-standing user of the Monash Research Cloud (R@CMon). He is the lead of the Song Lab within the Monash Biomedicine Discovery Institute. Jiangning’s journey began with the deployment of the Protease Specificity Prediction Server – PROSPER app in 2014. Since then the lab has launched more than 30 bioinformatics web services, all of which are made available to research communities worldwide.

Their latest contribution, iLearn, addresses key obstacles to the adoption of machine learning applied to sequencing data. Well-annotated DNA, RNA and protein sequence data is increasingly accessible to all biological researchers. However, at the scale of this data it is challenging if not impossible for an individual to manually investigate. Similarly, another obstacle to broad scale access is that investigation and validation through wet laboratory experiments is time consuming and expensive. Hence when presented appropriately, machine learning can play an import role making higher-level biological data accessible to many researchers in the biosciences.

Many of the previous works and tools only focus on a specific step within a data-processing pipeline. The user is then responsible for chaining these tools together, which in most cases is challenging due to incompatibilities between tools and data formats. iLearn has been designed to address these limitations, using common patterns informed by the lab and its collaborators.

An emerging breakdown of the pipeline steps is:

  • Feature extraction
  • Clustering
  • Normalization
  • Selection
  • Dimensionality reduction
  • Predictor extraction
  • Performance evaluation
  • Ensemble training
  • Results visualisation

iLearn packages these steps for use in two ways. Users can use iLearn through an online environment (web server) or as a stand-alone python toolkit. Whether your interest is in DNA, RNA or protein analysis, iLearn provides a common workflow pattern for all three cases. Users input their sequence data (normally in FASTA format), and then enters various descriptors and parameters for the analysis.

The results page shows the various output, once again informed by the Lab’s good-practices. They can be downloaded from the web server in various formats (e.g CSV, TSV). High quality diagrams and visualisations are also generated by iLearn within the web server:

Since iLearn’s release, more than 5K unique users have used the web server worldwide. The user community and resultant impact continues to grow, with 60 citations since the tool’s seminal publication.

iLearn has been used as an efficient and powerful complementary tool for orchestrating machine-learning-based modelling which in turn improves the speed in biomedical discoveries through genomics and data analysis. As new descriptors get developed and optimised, iLearn aims to incorporate these into future releases to further improve its performance with the R@CMon team providing support to tackle the potential increase in computational and storage complexities.

This article can also be found, published created commons here 1.

GlycoMine on R@CMon

Glycosylation is an ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes such as cellular communication, ligand recognition and subcellular recognition. It is estimated that greater than 50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive and laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilising this very important PTM.

Predicted N-linked glycosylation sites from two case-study proteins using GlycoMine-Struct

Dr. Jiangning Song from the Department of Biochemistry and Molecular Biology at Monash University and his collaborators have designed and developed a bioinformatics tool – GlycoMine-Struct for predicting glycosylation sites. GlycoMine-Struct is a comprehensive tool for the systematic in-silico identification of N-linked and O-linked glycosylation sites in the human proteome. Through R@CMon, a dedicated cloud project with computational and storage resources has been provisioned to develop and host the GlycoMine-Struct tool. The flexible and scalable R@CMon-powered development environment enabled rapid prototyping, testing and re-deployment of the tool.

GlycoMine-Struct Main Page (http://glycomine.erc.monash.edu/Lab/GlycoMine_Struct/index.jsp#Introduction)

GlycoMine-Struct is now a publicly accessible web service, available to the wider research community. Users can now easily submit protein structure input files in PDB (Protein Data Bank) format to perform sites prediction on GlycoMine-Struct. Since it went public, GlycoMine-Struct has been accessed and used by thousands of local and international users, and still growing. A scientific reports paper has been published, highlighting the collaborative work done to develop GlycoMine-Struct, as an essential bioinformatics tool for improving the prediction of human glycosylation sites. The R@CMon team is actively supporting the GlycoMine-Struct project as it continues to serve the research community and develop performance improvements.

XCMSplus Metabolomics Analysis on R@CMon

At the start of 2017, the R@CMon team had its first user consultation with Dr. Sri Ramarathinam, a research fellow from the Immunproteomics Laboratory (Purcell Laboratory) at the School of Biomedical Sciences in Monash University. Sri and his group at the lab studies metabolomics compounds in various samples by conducting a “search” and “identification” process using a pipeline of analysis and visualisation tools. The lab has acquired the license to use the commercial XCMSPlus metabolomics platform from SCIEX on their workflow. XCMSPlus provides a powerful solution for analysis of untargeted metabolomics data in a stand-alone configuration, which will greatly increase the lab’s capacity to analyse more samples, with faster and easeful results generation and interpretation.

XCMSPlus main login Page, entry point of the complete metabolomics platform

During the first engagement meeting with Sri and the lab, it’s been highlighted that a specialised hosting platform (with appropriate storage and computational capacity) would be required for XCMSPlus. XCMSPlus is distributed as stand-alone appliance (personal cloud) from the vendor. As an appliance, XCMSPlus has been optimised and packaged to be deployed on a single, multi-core and high-memory machine. An added minor complication is that this appliance was distributed in VMWare’s appliance format, which need to be translated into an OpenStack-friendly format. The R@CMon team provided the hosting platform required for XCMSPlus through the Monash node of the Nectar Research Cloud.

Analysis results and visualisation in XCMSPlus

A dedicated Nectar project has been provisioned for the lab, which is now being used for hosting XCMSPlus. This project also has enough capacity for future expansion and new analysis platform deployments. The now R@CMon-hosted (and supported) XCMSPlus platform for the Immunproteomics Laboratory is the first custom XCMSPlus deployment in Australia. Due to being the first in Australia, there were some early minor issues encountered during its first test runs. These technical issues were eventually sorted out due to collaborative troubleshooting efforts from the R@CM team, the lab and the vendor. And after several months of usage, hundred of jobs submitted and processed by XCMSPlus, and counting, the lab is continuing to fully integrate it as part of their analysis workflow. The R@CMon team is actively engaging with the lab for supporting its adaption of XCMSPlus and planning for future analysis workflow expansions.

Disruptive change in the clinical treatment of pancreatic cancer

Professor Jenkins’ research focuses on pancreatic cancer, an inflammation-associated cancer and the fourth most common cause of cancer death worldwide, with an extremely low 5% five-year survival rate. Typically studies look at gene expression patterns between normal pancreas and cancerous pancreas in order to identify unique signatures, which can be indicative of sensitivity or resistance to specific chemotherapeutic treatments.

“Using next generation gene sequencing, involving big instruments, big data and big computing – allows near-term disruptive change in the clinical treatment of pancreatic cancer.” Prof. Jenkins, Monash Health..

To date, gene expression studies have largely focused on samples taken from open surgical biopsy; a procedure known to be very invasive and only possible in 20% of pancreatic cancers. Prof Jenkins’ group, in collaboration with Dr Daniel Croagh from the Department of Upper Gastrointestinal and Hepatobiliary Surgery at Monash Medical Centre, recently trialled an alternative less invasive process available to nearly all pancreatic cancer patients known as endoscopic ultrasound-guided fine-needle aspirate (EUS-FNA) which uses a thin, hollow needle to collect the samples of cells from which genetic material can be extracted and analysed. The challenge then becomes to ensure gene sequencing from EUS-FNA samples is comparable to open surgical biopsy such that established analysis and treatment can be used.

Twenty-four EUS-FNA-derived genetic samples from normal and cancerous pancreas were sequenced at the MHTP Medical Genomics Facility producing a total amount of 40Gb of raw data. Those data were securely transferred onto R@CMon by the Monash Bioinformatics Platform for processing, statistical analysis and computational exploration using state-of-the-art Bioinformatics methods.


Results thus far from this study show that data from EUS-FNA-derived samples were of high quality and also allowed the identification of gene expression signatures between normal and cancerous pancreas. Professor Jenkins’ group is now confident that EUS-FNA-derived material not only has the potential to capture nearly all of pancreatic cancer patients (compared to ~20% by surgery), but to also improve patient management and their treatment in the clinic.

“The current clinical genomics research space requires specialized high performance computational and storage infrastructure to support the processing and long term storage of those so-called “big data”. Thus R@CMon plays a major role in the discovery and development of new therapies and the improvement of Human health care in general.” Roxane Legaie, Senior Bioinformatician, Monash Bioinformatics Platform


The Ramialison Group Analysis Workflow on R@CMon

The Ramialison Group at the Australian Regenerative Medicine Institute (ARMI) located in the biomedical research precinct of Monash University, Clayton specialises in systems biology both on the bench and through computational analysis. Their work is driven by the in vivo and in silico dissection of regulatory mechanisms involved in heart development, where deregulation of such mechanisms cause congenital heart disease, which results in 1 out of every 100 babies to be born with heart defects in Australia.

Heatmap generated from transcriptomic data from heart samples (Nathalia Tan)

Heatmap generated from transcriptomic data from heart samples (Nathalia Tan)

Their research focuses on identifying DNA elements that play a crucial role in the development of the heart and, that could be impaired in disease. To identify these sequences, several genome-wide interrogation technologies (genomics and transcriptomics) are employed on different model organisms such as mouse or zebrafish. Downstream analysis of the data generated from these experiments involves high performance computing and requires large storage, which can be up to hundreds of gigabytes in size for a single project.

To optimise their investigation into heart development, the R@CMon team has deployed a dedicated Decoding Heart Development and Disease (DHDD) server on the Monash node of the NeCTAR Research Cloud infrastructure, which has now been running for over a year. This has not only provided the group with faster processing speeds in comparison to running jobs on a local desktop, but also an appropriate file storage infrastructure with persistent storage for files that are regularly accessed during analysis. Through VicNode, the group has been given vault storage for archiving completed results for their various research projects. With the assistance R@CMon, the group has been able to easily add users to the server as it continues to grow with new members and local collaborators.

Web interface for the Trawler web service.

Web interface for the Trawler web service.

In addition to the DHDD server, the R@CMon team also assisted the Ramialison Group in deploying a dedicated cloud server that has been used to develop the Trawler motif discovery tool web service. The implementation of this tool allows the group to quickly and easily analyse next-generation sequencing data and identify overrepresented motifs, which has led to a manuscript that is currently in preparation. The Ramialison Group envisage future developments of similar simple and easy to use bioinformatics analysis tools through R@CMon.

Histone H3.3 Analysis on R@CMon

The Epigenetics and Chromatin (EpiC) Lab at Monash University is working on understanding how mutations in certain chromatin factors promote the formation of brain tumours. This project involves the generation and analysis of high-throughput sequencing data of chromatin modifications and remodellers in normal and mutated cells. The sequencing is carried out at the MHTP Medical Genomics Facility and the resulting datasets are then imported into  the analysis workflow running on the Monash node (R@CMon) of the NeCTAR Research Cloud. The sequencing reads are first aligned to the repetitive fraction of the genome using a script developed by Day et al. (Genome Biology 2010) to determine enrichment at repeats. Sequencing reads are then aligned to the genome using Bowtie. The resulting files are filtered for quality, poor matches and PCR duplicates using customised Perl scripts. The filtered files are then imported into SeqMonk for further analysis.

Overlap analysis using SeqMonk

Overlap analysis using SeqMonk

This allows for rapid visualisation of individual aligned reads across the entire genome. The inbuilt MACs peak caller is used for first pass peak calling. A selection of peaks is then validated in the lab by ChIP-qPCR experiments and peak-calling parameters can be adjusted based on these results. Overlap analysis with regions of interest can be performed in SeqMonk. Aligned sequence files are converted to BigWig format using customised Perl scripts and uploaded onto the NeCTAR Object Storage (Swift), which can then be loaded seamlessly on the UCSC Genome Browser for visualisation and further investigation. Once the sequence files are uploaded to the object storage, it can then be easily compared against public ENCODE datasets and UCSC genomic annotations to identify any potentially interesting correlations.

Aligned sequence visualisation using the UCSC Genome Browser.

Aligned sequence visualisation using the UCSC Genome Browser.

The R@CMon team and the Monash Bioinformatics Platform supported the EpiC Lab by deploying a dedicated analysis instance on the NeCTAR Research Cloud based on the training environment first developed for the BPA-CSIRO Bioinformatics Training Platform. The open access and reusability of the training platform means it can be easily readapted to various analysis workflows. The R@CMon team and the Monash Bioinformatics Platform will continue to engage with the EpiC Lab as they grow and scale their analysis workflow on the NeCTAR Research Cloud.

Interferome on R@CMon

Interferons (IFNs) were identified as antiviral proteins more than 50 years ago. However, their involvement in immunomodulation, cell proliferation, inflammation and other homeostatic process has since been identified. These cytokines are used as therapeutics in many diseases such as chronic viral infections, cancer and multiple sclerosis. These IFNs regulate the transcription of approximately 2000 genes in a IFN subtype, dose, cell type and stimulus dependent manner. 

Interferome Wordle

Interferome Wordle

Interferome is an online database of IFN regulated genes.  The database is a valuable resource for biomedical researchers, being regularly used by scientists from across the world. This database of IFN regulated genes is an attempt at integrating information from high-throughput experiments to gain a detailed understanding of IFN biology. Interferome enables reliable identification of an individual Interferon Regulated Gene (IRG) or IRG signatures from high-throughput data sets (i.e. microarray, proteomic data etc.). It also assists in identifying regulatory elements, chromosomal location and tissue expression of IRGs in humans and mice.

Interferome Database Statistics

Interferome Database Statistics

The R@CMon team assisted Prof. Paul Hertzog and the Centre of Innate Immunity & Infectious Diseases at MIMR-PHI in migrating versions 1.0 and 2.0 of the Interferome online database into the NeCTAR Research Cloud. Interferome Version 2.0 has quantitative data, more detailed annotation and search capabilities and can be queried for one gene or thousands as in a gene list from a microarray experiment. To ensure availability of data and assist researchers with hypothesis generation and novel biological discoveries, the Interferome database is backed by VicNode Collection 2014R9.06. More information about Interferome is available on the help page.