Author Archives: Jerico Revote

Secure safehaven for the ASPREE clinical trial – The need

The year 2017 began with the ASPREE Data Management Team seeking advice on an emerging need for a collaborative sensitive analysis environment. Back then HPC and clouds were very much the realm of categorically non-sensitive data, and secure (“red zoned”) systems were very much the realm of categorically non-collaborative data. This bifurcation was rife in health and social sciences. This Research Cloud engagement with ASPREE was seminal work to transition the Monash research environment towards a continuum (rather than bifurcation) between collaboration and sensitive expectations.

The ASPREE team had a single commodity physical PC located at the ASPREE office in the School of Public Health and Preventive Medicine. Despite the ASPREE team streamlining processes to appropriately allow project collaborations, an innovation in its own right, collaborators could only perform analysis by being physically in the office. The protocol required data custodians to copy ASPREE phenotypic datasets (via USB sticks) onto the PC, whilst also physically disconnecting the ethernet cable to ensure no unintended access. Collaborators would fly into Melbourne just to run their analysis. This logistically-taxing workflow made collaboration hard and significantly delayed research outcomes. As new project requests emerged from ASPREE sub studies, it became apparent that data management, data governance and the analysis ecosystem would need to be revamped to support the growing demand. A scalable and secure “safe haven” was required.

The Research Cloud team approached the situation from a pragmatic point of view. The team first critiqued the scalability of the analysis environment. We discovered the environment would require security-hardening to protect against intentional and unintentional data leakage. Furthermore the interfacet needed improvement to become intuitive for non-academic, external and international collaborators.

Figure 1. A typical ASPREE Analysis Environment

Fortunately Leostream, a remote desktop (VDI) scheduling platform, was already being used by virtual laboratories on the Monash zone of the Research Cloud. Leostream provides a high-level interface for allocating remote desktops to users. It also allows access to these remote desktops through a web-based (HTML5) viewer. To scale out the analysis environment, the team deployed a number of Windows-based instances on the Monash zone of the Research Cloud. These instances have been pre-configured with analytical tools chosen by the ASPREE community (e.g R/RStudio, SAS, SPSS) and connected to the Monash license servers. A typical analysis environment is shown in Figure 1. above. Access is managed through the Monash Active Directory (Domain) and each analysis instances have been configured with Group Policy Objects (GPOs). These GPOs enforced a number of rules or security controls inside the instances, e.g preventing users from changing desktop settings, access to registry tools and much more. A reserved set of hypervisors have been used to host these secure instances, which also reside on a segregated private network. Hyper-threading has been turned-off on the hypervisors to minimise the risk of Spectre/Meltdown-type vulnerabilities.

Figure 2. ASPREE safehaven architecture

A high-level architecture diagram for the ASPREE safehaven is shown in Figure 2. Monash eResearch Centre’s Research Data Storage (RDS) provides a scalable storage backend to the safehaven. The team augments the storage pool with further controls to appropriately segregate the data. A dedicated user share is created for each approved ASPREE user. This user share is autonomously mounted into the analysis environment upon user login. ASPREE data custodians (managers) have elevated rights to the safe haven storage. They can review (approve or deny) what data goes in (ingress) and data going out (egress). Thus the technology / workflow automates the overall data governance of the ASPREE clinical trial by incorporating it to their own access management system (AMS).

Now operational for more than 3 years, the Research Cloud at Monash and Helix teams cooperate to provide user support for ASPREE safe havens. Several other registries and clinical trials have leveraged this ASPREE solution as their own safe haven. To date, over 100+ internal and international collaborators have used the ASPREE safe haven. This work has be foundational to Monash eResearch, the Research Cloud and Helix’s initiatives towards the next-generation safe havens (e.g. SeRP, which further automates and audits generalised governance workflows).

“The ASPREE data is an NIH-supported clinical trial, and the NIH rightly demands full accountability for data handling. The team has been understanding, professional, flexible and fast. They gave extra consideration for ASPREE’s urgent need (in 2016-17) to share our large and unique dataset to collaborators, whilst also supporting confidentiality in an active clinical trial. The co-design approach took into consideration our Data Manager’s detailed requirements and produced an excellent environment for effective use and international collaboration centred on ASPREE data. The successfully funded extension study ASPREE-XT depended on getting this right.”

Dr Carlene Britt, ASPREE Senior Research Manager and ASPREE Data Custodian

This article can also be found, published created commons here 1.

Kaptive – How novel searches within bacterial genomic data are presented and hosted on R@CMon

Dr. Kelly Wyres is a research fellow in the Holt Lab. Kelly first approached the Research Cloud at Monash team in 2019. She sought assistance to migrate their bioinformatics web application – Kaptive to Monash infrastructure. Kaptive is a user-friendly tool for finding known loci within one or more pre-assembled genomes, specifically for the identification of Klebsiella surface polysaccharide loci. It presents these results in a novel and intuitive web interface, helping the user to rapidly gain confidence in locus matches. Kaptive has been developed and currently maintained by Kelly Wyres, Ryan Wick and Kathryn Holt at Monash University. It also uses bacterial reference databases that are carefully curated by Kelly Wyres and Johanna Kenyon from Queensland University of Technology.

Wick RR, Heinz E, Holt KE and Wyres KL 2018. Kaptive Web: user-friendly capsule and lipopolysaccharide serotype prediction for Klebsiella genomes. Journal of Clinical Microbiology: 56(6). e00197-18

The R@CMon team provided its standard LAMP platform to host Kaptive on the Research Cloud. This included helping Kelly transition Kaptive from its original web2py mechanism (to quickly create web applications), to a production grade LAMP stack including a dedicated web server and storage backend. Now transitioned, the team can efficiently and effectively cooperate with Kaptive alongside a critical mass of other domain-specific LAMP based applications across all disciplines of research. The team also assisted in applying additional security controls (e.g HTTPS/SSL, reCAPTCHA) on the server to improve its security posture. As a measure of impact, more than 3000 searches (and associated computing jobs) have been submitted into Kaptive to date. As new reference databases become ready and curated, it’ll then be incorporated into Kaptive and made available to the research community.

This article can also be found, published created commons here 1.

Co-designing clouds for the data future of fintech : the next generation of StockPrice infrastructure

We first discussed the emergence of “big data”, and its impact on computing and storage needs, with Associate Professor Paul Lajbcygier and his team in 2014. The Research Cloud at Monash initial engagement enabled the “Stock Price Impact Models Study” to get off the ground with immediate high-impact research output. A few months later, in 2015, we’ve showcased their incremental update to the study “Stock Price Impact Models Study on R@CMon Phase 2 (Update)”, which produced another high-impact publication. Then in 2018, Associate Professor Paul Lajbcygier and Senior Lecturer Huu Nhan Duong held the “Monash workshop on financial markets” at the Monash University, attracting highly prominent Australian and international researchers to talk about topics such as “market design and quality”; “high frequency trading”; “volatility and liquidity modelling”; and many more.

Pham, Manh Cuong and Duong, Huu Nhan and Lajbcygier, Paul, A Comparison of the Forecasting Ability of Immediate Price Impact Models (September 18, 2015). Available at SSRN: https://ssrn.com/abstract=2515667 or http://dx.doi.org/10.2139/ssrn.2515667

Fast forward to 2020 and despite the current world and local circumstances, Paul and his team continue to excel in producing more high impact research outcomes. Their recent successes include a “Journal of Economic Dynamics and Control” publication entitled “The effects of trade size and market depth on immediate price impact in a limit order book market” and an Interfaculty Seeding Grant with the Monash Business School and Faculty of Information Technology to study high frequency trading using machine learning methodologies. There are also numerous research outputs to be submitted towards the end of 2020 and many more towards Q1 of 2021. This surge in high impact outputs correlates to a recent optimisation in the way big queries are executed on the memory engine of the underlying R@CMon-hosted database.

The speed up compared to previous data runs is around four times. This means we can now use more of the memory in the big memory machine effectively.

Paul Lajbcygier, Associate Professor, Banking & Finance, Monash Data Futures Institute

The R@CMon team are currently preparing for the next round of cloud resources uplift in 2021 where “persistent memory” (e.g Intel Optane DC) components are being considered to be included in the resource pool (flavours) available to research cloud users. This could provide even more substantial speedups to big queries on stock price big data. Once ready, the R@CMon team will engage Paul’s team again to utilise these resources.

This article can also be found, published created commons here 1.

iLearn on R@CMon

An integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data : the impact of making machine learning good practice readily available to the community.

Associate Professor Jiangning Song is a long-standing user of the Monash Research Cloud (R@CMon). He is the lead of the Song Lab within the Monash Biomedicine Discovery Institute. Jiangning’s journey began with the deployment of the Protease Specificity Prediction Server – PROSPER app in 2014. Since then the lab has launched more than 30 bioinformatics web services, all of which are made available to research communities worldwide.

Their latest contribution, iLearn, addresses key obstacles to the adoption of machine learning applied to sequencing data. Well-annotated DNA, RNA and protein sequence data is increasingly accessible to all biological researchers. However, at the scale of this data it is challenging if not impossible for an individual to manually investigate. Similarly, another obstacle to broad scale access is that investigation and validation through wet laboratory experiments is time consuming and expensive. Hence when presented appropriately, machine learning can play an import role making higher-level biological data accessible to many researchers in the biosciences.

Many of the previous works and tools only focus on a specific step within a data-processing pipeline. The user is then responsible for chaining these tools together, which in most cases is challenging due to incompatibilities between tools and data formats. iLearn has been designed to address these limitations, using common patterns informed by the lab and its collaborators.

An emerging breakdown of the pipeline steps is:

  • Feature extraction
  • Clustering
  • Normalization
  • Selection
  • Dimensionality reduction
  • Predictor extraction
  • Performance evaluation
  • Ensemble training
  • Results visualisation

iLearn packages these steps for use in two ways. Users can use iLearn through an online environment (web server) or as a stand-alone python toolkit. Whether your interest is in DNA, RNA or protein analysis, iLearn provides a common workflow pattern for all three cases. Users input their sequence data (normally in FASTA format), and then enters various descriptors and parameters for the analysis.

The results page shows the various output, once again informed by the Lab’s good-practices. They can be downloaded from the web server in various formats (e.g CSV, TSV). High quality diagrams and visualisations are also generated by iLearn within the web server:

Since iLearn’s release, more than 5K unique users have used the web server worldwide. The user community and resultant impact continues to grow, with 60 citations since the tool’s seminal publication.

iLearn has been used as an efficient and powerful complementary tool for orchestrating machine-learning-based modelling which in turn improves the speed in biomedical discoveries through genomics and data analysis. As new descriptors get developed and optimised, iLearn aims to incorporate these into future releases to further improve its performance with the R@CMon team providing support to tackle the potential increase in computational and storage complexities.

This article can also be found, published created commons here 1.

Monash Business School Financial Markets Workshop

Last April 30 to May 1, Associate Professor Paul Lajbcygier and Senior Lecturer Huu Nhan Duong from the Monash Business School organised a Financial Markets Workshop at Monash Caulfield Campus, bringing in a number of prominent Australian and international market microstructure researchers as well as high-profile high frequency traders and regulators from the US. The workshop covered several research topics such as “market design and quality”; “high frequency trading”; “volatility and liquidity modelling”; “short selling”; “stock market crashes”; “cryptocurrencies”; and the real effect of financial markets on corporate decisions. The R@CMon team has worked with Paul’s group for several years now, supporting their “big data analysis” workflows on the research cloud. Enabling them to crunch more data, which contributed in several high-impact publications, ARC grant submissions and attainment of a major SEED funding. The international financial workshop event marks the culmination of Paul’s groups accomplishments in high frequency trading research over the years and serves as foundation for future critical mass of research in financial markets. The R@CMon team will continue to support Paul’s group and the Department of Banking and Finance as they work on more high-impact research and in tackling various computational challenges that they may encounter along the journey.

GlycoMine on R@CMon

Glycosylation is an ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes such as cellular communication, ligand recognition and subcellular recognition. It is estimated that greater than 50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive and laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilising this very important PTM.

Predicted N-linked glycosylation sites from two case-study proteins using GlycoMine-Struct

Dr. Jiangning Song from the Department of Biochemistry and Molecular Biology at Monash University and his collaborators have designed and developed a bioinformatics tool – GlycoMine-Struct for predicting glycosylation sites. GlycoMine-Struct is a comprehensive tool for the systematic in-silico identification of N-linked and O-linked glycosylation sites in the human proteome. Through R@CMon, a dedicated cloud project with computational and storage resources has been provisioned to develop and host the GlycoMine-Struct tool. The flexible and scalable R@CMon-powered development environment enabled rapid prototyping, testing and re-deployment of the tool.

GlycoMine-Struct Main Page (http://glycomine.erc.monash.edu/Lab/GlycoMine_Struct/index.jsp#Introduction)

GlycoMine-Struct is now a publicly accessible web service, available to the wider research community. Users can now easily submit protein structure input files in PDB (Protein Data Bank) format to perform sites prediction on GlycoMine-Struct. Since it went public, GlycoMine-Struct has been accessed and used by thousands of local and international users, and still growing. A scientific reports paper has been published, highlighting the collaborative work done to develop GlycoMine-Struct, as an essential bioinformatics tool for improving the prediction of human glycosylation sites. The R@CMon team is actively supporting the GlycoMine-Struct project as it continues to serve the research community and develop performance improvements.

XCMSplus Metabolomics Analysis on R@CMon

At the start of 2017, the R@CMon team had its first user consultation with Dr. Sri Ramarathinam, a research fellow from the Immunproteomics Laboratory (Purcell Laboratory) at the School of Biomedical Sciences in Monash University. Sri and his group at the lab studies metabolomics compounds in various samples by conducting a “search” and “identification” process using a pipeline of analysis and visualisation tools. The lab has acquired the license to use the commercial XCMSPlus metabolomics platform from SCIEX on their workflow. XCMSPlus provides a powerful solution for analysis of untargeted metabolomics data in a stand-alone configuration, which will greatly increase the lab’s capacity to analyse more samples, with faster and easeful results generation and interpretation.

XCMSPlus main login Page, entry point of the complete metabolomics platform

During the first engagement meeting with Sri and the lab, it’s been highlighted that a specialised hosting platform (with appropriate storage and computational capacity) would be required for XCMSPlus. XCMSPlus is distributed as stand-alone appliance (personal cloud) from the vendor. As an appliance, XCMSPlus has been optimised and packaged to be deployed on a single, multi-core and high-memory machine. An added minor complication is that this appliance was distributed in VMWare’s appliance format, which need to be translated into an OpenStack-friendly format. The R@CMon team provided the hosting platform required for XCMSPlus through the Monash node of the Nectar Research Cloud.

Analysis results and visualisation in XCMSPlus

A dedicated Nectar project has been provisioned for the lab, which is now being used for hosting XCMSPlus. This project also has enough capacity for future expansion and new analysis platform deployments. The now R@CMon-hosted (and supported) XCMSPlus platform for the Immunproteomics Laboratory is the first custom XCMSPlus deployment in Australia. Due to being the first in Australia, there were some early minor issues encountered during its first test runs. These technical issues were eventually sorted out due to collaborative troubleshooting efforts from the R@CM team, the lab and the vendor. And after several months of usage, hundred of jobs submitted and processed by XCMSPlus, and counting, the lab is continuing to fully integrate it as part of their analysis workflow. The R@CMon team is actively engaging with the lab for supporting its adaption of XCMSPlus and planning for future analysis workflow expansions.