Tag Archives: Using the Cloud

ASPREE information systems for genomic medicine using targeted panel sequencing

Precision medicine and genomics hold great potential for improved detection of cancer, particularly the targeted DNA sequencing of genes that indicate the risk of developing cancer in the future. In 2015, solving this problem required a technologically complex task of combining advanced genomics analysis with extensive medical (phenotypic) health data. The research domain wasn’t there yet. It was still exploring, and here’s a part we played with Australia’s largest clinical trial.

Associate Professor Paul Lacaze is the head of the Public Health Genomics Program within the School of Public Health and Preventive Medicine, Monash University. Since 2015 this program has formed an integral part of the ASPREE study¹ and ASPREE Healthy Ageing Biobank². Their strategy was to partner with genomic sequencing facilities from across the globe, who each bought distinct expertise to the challenge of sequencing thousands of ASPREE participants for precision medicine applications. Each partnership has since been on a mission to integrate the study’s phenotypic and clinical outcome data with its novel techniques to understand the role of genetics in healthy ageing and diseases.

Hence our multiple global research collaborations needed an environment where they could discover how to join sensitive and big data such that insights could emerge. Solving such multi-disciplinary techno-social problems is the bread & butter of the digital cooperatives group within the Monash eResearch Centre (MeRC). We activated a hybrid of HPC-like and cloud resources appropriate for the active merging of sensitive clinical trial data with the targeted DNA sequencing data to solve this problem. The researchers explored a range of computational tools & techniques that were in themselves still an experiment. Learnings from engagements like this inform the processes and procedures we have today. Most pertinently, however, we took those communities and their respective organisations through the journey, and they now enjoy low barriers to generating impacts from these collaborations.

One of the research-led sequencing technologies or techniques is the “Targeted Sequencing” (Super Panel). The program collaborated with the Icahn School of Medicine at Mount Sinai to design a panel with around ~700 distinct genes that capture the following gene groups: Cancer Genes, Cardiovascular Genes, PGX Genes, ACMG 56 Genes, Resilience Genes and Maturity Onset Diabetes of the Young (MODY) Genes.

From a logistical point of view, one of the advantages of target sequencing is the smaller storage footprint compared to whole-genome sequencing (which can be ten times bigger in terms of file size). The Mount Sinai group sequenced 13,000 ASPREE samples over several months. That generated ~30TB of sequence alignment files (BAMs) and variation files (VCFs). The first task (back then) was to establish a secure transfer channel for the data to Monash for both storage and downstream analysis. Immediately Paul identified that he did not have the tool or service readily available for this task. That’s when he engaged with MeRC. The Research Cloud at Monash (R@CMon) and digital cooperatives teams within MeRC provided the solution to address the project’s data transfer, storage, and computational requirements (see Figure 1 below). We have since co-operated this infrastructure with them.

Figure 1. ASPREE Targeted Panel Ecosystem

In addition to collaborating with the Public Health Genomics Program and the program collaborating with clinical genomics leaders from across the globe, we collaborated with a software vendor. Together with BC Platforms, we designed a custom-build information system that meets the requirements of both clinical and genomics data processing. The digital cooperatives team provided hosting through the Research Cloud and Research Data Storage. We configured the analysis servers deployed on the Research Cloud with bioinformatics tools for processing the genomics data. Additionally, we deployed three core commercial products from BC Platforms: BC Genome – a secure online database (data warehousing system) for storing and dynamic analysis of genotype and phenotype data, BC Safebox – a secure remote desktop environment for controlled access and collaborative research management, and BC Predict – a web service for variant interpretation, curation and reporting, designed for clinical and medical researcher uses in pathogenicity. Figure 2 below shows an example of the variant curation interface in BC Predict.

Figure 2. Variant Curation in BC Predict

The digital cooperatives team deployed BC Platforms and the surrounding environment in a manner appropriate for sensitive genomics information. In collaboration with Monash University’s central IT (eSolutions), we contracted an external security penetration testing service to assess the deployment for handling sensitive information without losing the inherent scalability and configurability of the Monash Research Cloud. Figure 3 shows the high-level components of the ASPREE genomics information system.inherent scalability and configurability of the Monash Research Cloud. A high-level components diagram of the ASPREE genomics information system is shown in Figure 3.

Figure 3. ASPREE Genomics Information Systems Components

After four years of operations, the genomics system continues to establish close collaborations with national and international research communities. It has produced high impact research outcomes along the way³ ⁴ ⁵. The R@CMon team is excited about supporting the ASPREE Genomics team as it scales up its research endeavours.

This article can also be found, published created commons here ⁶.

Breast Cancer Knowledge Online

The history

Even disruptive research tools created as recently as 10years ago, and yet fundamental to improving human interactions with information and computers, are susceptible to the onslaught of cyber security threats that exist today! Sometimes, all that the research fraternity needs is access to small amounts of skilled engineers (both crowd sourced and research software engineers) to make the small changes needed to keep such research infrastructure robustly safe. For community focused research the longevity of the solution is very important. Yet, the research prototypes quite often use open source software which if not updated can attract some security risks.

The technical team at R@CMon is staying vigilant to ensure the research prototypes produced as a result of the research projects can stay usable and useful to the communities even after the research part of the projects were completed. A good example of such an impactful, long research prototype Breast Cancer Knowledge Online which survived many years of use thanks to the hard work of the researchers supported by the R@CMon team.
Professor Frada Burstein, Department of Human Centred Computing, Monash Data Futures Institute, Victorian Heart Institute (VHI)

The Monash Faculty of IT initiative led by Professors Frada Burstein and Sue McKemmish in collaboration with BreastCare Victoria and the Breast Cancer Action Group developed a comprehensive online portal of information pertinent to those facing serious health issues related to breast cancer. This work was supported by Australian Research Council and philanthropic funding (Linkage Grant (2001-2003), Discovery (2006-2009), Telematics Trust (2010, 2012), and the Helen Macpherson Smith Trust (2011), resulting in three consecutive implementation efforts of the unique smart health information portal. The full project team is listed on the portal’s “Who We Are” page. The research focussed on the role of personalised searching and retrieval of information, where for example, the needs and preferences of women with breast cancer and their families change over the trajectory of their condition. In contrast a web search bar 10 years ago was generic with very little situational awareness about the person who is searching. The resultant tool, Breast Cancer Knowledge Online (BCKOnline), empowers the individual user to determine the type of information which will best suit her needs at any point in time. The BCKOnline portal uses metadata-based methods to present users with a quality score for data from other public resources carefully curated by breast cancer survivors and other well informed domain experts. The portal’s metadata descriptions of information resources also describe resources in terms of attributes like Author, Title, and Subject. A summary of the information resource and a quality report is also provided. The quality report provides information on where the information came from and who wrote it so the woman can decide if she ‘trusts’ the source.

The underlying technical infrastructure of the portal is utilising open source solutions and has been released to the public in two distinct versions (see Figure 1a and 1b for the interfaces for the personalised search for BCKOnline).

Figure 1a – BCKOnline personalised search (version 2)

The 2009 paper ¹ describes the solution as a paradigm shift in quality health information provision sharing, specifically for women and their families affected by breast cancer. BCKOnline has been used for over 100K+ personalised searches across its over 1K curated quality resources. It has been a valuable resource to teach information management students about the process and value of metadata cataloging. More about this research can be found in these papers ² ³ ⁴ ⁵ ⁶ ⁷ ⁸.

Figure 1b. BCKOnline’s personalised search based on user profiles (version 3).

The search results page example is shown in Figure 2 below.

A few years later

Nine years on (in 2019) the maintainers of BCKOnline led by Dr Jue (Grace) Xie, who’s PhD was also connected to the portal development, reached out to the Research Cloud at Monash team (R@CMon), seeking assistance to migrate BCKOnline from its legacy infrastructure to a modern cloud environment and contemporary security controls. Through the ARDC Nectar Research Cloud [2], a new hosting server was deployed for the revamped BCKOnline. Our team walked Frada and Grace through the standard operating procedure to migrate the application to its new home on the research cloud, where Frada and her team have full transparency and control over the application’s lifecycle. The revamped BCKOnline includes a host of security best practices for digital research infrastructure, such as a long term support operating system and proper SSL termination in the web server.

Figure 2. BCKOnline search results, showing a curated list of resources with additional filtering options.

Another step in security best practices for research applications

Recently, the Monash University Cyber Risk & Resilience (CISO’s office) and our teams embarked on a journey to uplift the security profile of all applications on our Research Cloud infrastructure. It is a strategic step change in the University’s expectations regarding security best practices. In partnership with Bugcrowd the Research Cloud at Monash participates in the Vulnerability Disclosure Program (VDP), where all applications are regularly scanned for active threats and vulnerabilities. Bugcrowd are novel in that they vet what is essentially a crowd-sourced team of cyber security engineers. When vulnerabilities are indeed identified, we kick in with a standard operating procedure that is cognisant of research practice and culture to address the issues. This procedure includes end-to-end communication and coordination between the security team, the Research Cloud team and the affected service owners (the chief investigators).

In a recent security scan, we discovered that the BCKOnline portal was vulnerable to “Cross Site Scripting (XSS)”, a method often used by bad actors to conduct attacks like phishing, temporary defacements, user session hijacking, possible introduction of worms etc. Typically these vulnerabilities are quick to fix for a research group (a handful of hours or at most days), and our evidence suggests researchers are motivated to fix them quickly to ensure their systems stay both alive and reputedly safe.

Fixing this vulnerability was complicated by commonplace research realities. The original developers were no longer available (the PhD students had long moved on). The source code to the impacted part of the application was not within a version control system. After some time and a bit of detective work, the R@CMon team managed to recover the original source and upload it into a private GitLab. With that complexity solved, the next step was to apply a fix for the XSS vulnerability. Realising the R@CMon DevOps team didn’t have the expertise nor capacity to fix the problem, we attempted to outsource the problem to professional contractors. However, after two false attempts a new approach was taken. The R@CMon team reached out to another team within the Monash eResearch Centre. The Software Development (SD) team brings with them an extensive array of software development expertise and best practices, including DevOps and security practices, which have been vital assets for this software engineering activity. We effectively crowd-source this remediation work to the team (where individuals pick which cases work for them, and they are appropriately rewarded for work they do in their own time).

Simon Yu, a veteran developer within the software development team pinpointed the actual source of the vulnerability in the code. He then quickly implemented a fix by creating a custom “filter” and “interceptor”. The resultant fix is efficient in both its load on the computing resource and its ability to protect other parts of the BCKOnline application with little/no research effort. Now any incoming requests (e.g user input, searches) will pass through the filter and interceptor first, validating its payload before being processed by the BCKOnline search engine. This ensures that only legitimate payloads are processed. We additionally placed the BCKOnline portal URL (https://bckonline.erc.monash.edu/) behind a web application firewall (WAF) managed by the Monash Cyber Risk and Resilience team. This provides an additional layer of security as all incoming traffic (payloads) are first sanitised by the WAF before forwarding it to the actual server. The original security advisory has since been resolved and the BCKOnline portal is back serving the online community with their personalised health searches.

This article can also be found, published created commons here ⁹.

Revisiting the next generation of StockPrice infrastructure

For many facets of our lives, long-term public good relies on a healthy tension between competition and stability. The age of digital disruption has profoundly changed the nature of competition in financial markets, to the extent that regulation has not always been adequate to ensure stability. Associate Professor Paul Lajbcygier and his colleague Rohan Fletcher from the Monash Business School are custodians of a longitudinal study seeking to understand when stability has been superseded by innovations. To their surprise, the recent Nectar Research Cloud refresh has caused a digital disruption to their own research, reducing analysis time from many months to one week, and in turn changing the focus of research.

How deep does the digital disruption rabbit hole go? We asked Paul and Rohan to tell us about it…

“With the advent of the IT revolution, financial exchanges have changed beyond recognition. With the Centre’s help, we have focused our research on how the digital disruption has affected financial markets, considered welfare implications, and potential regulatory changes, with ramifications for regulators, traders, superannuants and all equity market stakeholders.”
Associate Professor Paul Lajbcygier

Recently, the Monash eResearch centre has supported Paul’s team by upgrading new essential hardware and infrastructure necessary to interrogate the vast data generated from the Australian equity markets.

“Without the Centre’s help, our research would be impossible”.
Associate Professor Paul Lajbcygier

To get an understanding how the refreshed Research Cloud would affect the team, they benchmarked the new hardware against their data. They rerun database MySQL code which searches vast amounts of ASX stock data in order to understand the costs of stock trading using new, innovative price impact models. That prior work led to an A* publication in 2020 in the Journal of Economics, Dynamics and Control ¹.

This analysis interrogates over 1300 stock’s and their related trades and orders from 2007 to 2013, representing three terabytes of data.

“In order to implement this huge processing task, we have automated the breakdown of these MySQL queries by stock, year and month. This generates over 80,000 SQL scripts, which is a total of 3 gigabytes of SQL analysis code alone.”
With the latest hardware provided by Monash eResearch centre, this query took around one week, in contrast to the many months of required running time prior to the provision of the latest hardware.”
Associate Professor Paul Lajbcygier

The architecture of the embarrassingly parallel Stock Price infrastructure hosted on the Nectar Research Cloud consists of a large memory machine running Ubuntu LTS, and a number of smaller staging, analysis machines based also on Ubuntu LTS and Windows.

The project uses MySQL and bash scripts to facilitate the templating of SQL jobs. Scripts are generated on the small VM and then moved to the big memory machine for execution. The number of scripts on the big memory machine is monitored and is held at a pre-set maximum. On completion of a script on the MySQL database the next available script is launched automatically.

Testing of a single script may be performed on the staging database after some automatic modifications have been made to make it compatible for individual execution.

Below is an example chart comparing the execution time for an SQL script when utilising each of the four underlying database storage technologies now available on a big memory machine on the Monash node of the Research Cloud. These being: MySQL’s Memory Storage Engine, utilising RAM Drive, utilising Flash Drive and utilising a mounted volume (a separate Ceph cluster via RoCE). Clearly, the use of the memory engine (blue line) provides the best performance.

“Repeating our published benchmark, the MySQL memory engine is approximately forty times better than the flash drive and mounted volume. This outstanding Memory engine performance occurs because the memory engine is internal to MySQL, thereby avoiding input/output lags required of the file system.”
Associate Professor Paul Lajbcygier

Stock Price search performance based on storage backends.

The team then sought observables to explain why the memory approach performed so well. Below is an example of system load as recorded using Ganglia running the same SQL query. To the left is the Memory Engine CPU load usage, followed by examples of the Flash Drive, Mounted Volume and RAM Drive CPU load usage.

“It is possible to see that the Memory engine utilises all 120 CPU processes consistently, in contrast to the right hand graph which shows other memory methods which do not efficiently utilise the new hardware and incur overheads due to the requirement that they must use the file system.”
Associate Professor Paul Lajbcygier

In addition to fine tuning MySQL to the specialist hardware (what they named the “Big Memory Machine”), the benchmarking necessitated the integration of a bespoke Microsoft Windows ecosystem of tools. They used the open source tool HeidiSQL, to both visualise and automate the decomposition of the analysis problem to 120 parallel executing SQL scripts.

To summarise, We’ve asked Paul what the overall impact of the revamped Stock Price infrastructure in answering their research questions.

“We’re able to utilise data analysis on a more comprehensive data set including the ASX and the US NASDAQ, perform rapid prototyping with quick feedback; and complete analyses that would be intractable using the previous infrastructure.”
Associate Professor Paul Lajbcygier

This article can also be found, published created commons here ².

Triple Mechanism Cognitive Impulsivity Battery

Professor Antonio Verdejo Garcia and colleagues from the Verdejo-Garcia Laboratory at Turner Institute for Brain and Mental Health, engaged a local video game developer – TorusGames, to write computer games that are used to better understand impulsivity (a common symptom of substance addictions, obesity and eating disorders). We helped migrate this application to the Research Cloud, where barriers to infrastructure scaling, reuse and appropriate data governance have been removed. Back in 2018 the lab asked for advice on publishing apps online. It turned out the applications are a battery of interactive web applications that are designed to measure cognitive impulsivity of its users. This project was supported by an ARC Linkage Project (LP150100770), which aimed to study and measure the cognitive skills that can produce (or avoid) impulsive human behaviour.

*“Prospector’s Gamble” mini-game, a mining game where the right prospector need to be chosen for the job.*

The project engaged a local 3rd party developer (Torus Games) to develop the games (ahem, cognitive applications) to a pre-production level using Google’s Firebase platform (enables developers to develop iOS, Android and Web apps easier). Our job was to help the lab to migrate the application into the Nectar Research Cloud at Monash (R@CMon), which also meant mapping a pathway away from Firebase. The project and its data custodians have full governance on the data captured by the applications as part of their data collection activities.

The resulting application suite, the “Cognitive Impulsivity Suite (CIS)”, is a series of connected services. The web-app itself is a Unity-based WebGL build with a RESTful API using an ASP.NET backend. The app stores the user observations (generated “measures”) from the “trials” into a relational database backend. A Windows-based server is required to host the .NET-based application using the Internet Information Services (IIS) web server. R@CMon provided the required cloud resources and web configuration (.Net, IIS) to migrate the “CIS” application suite from Google Firebase. A high level pipeline diagram of CIS deployment is shown below.

*The Cognitive Impulsivity Suite (CIS) pipeline on the Monash Research Cloud*

3 years on, the lab is currently conducting 5 major projects using the “CIS” infrastructure on the Research Cloud. The largest of these studies assess impulsivity in more than 1000 US and Australian help-seeking and anonymous participants with drug, alcohol and gambling problems. There are 2 articles ¹ ² that have been published recently from various impulsivity studies. These research outcomes have utilised the full capabilities of the “Cognitive Impulsivity Suite (CIS)” on the Research Cloud.

The research cloud has enabled us to reliably collect and store cognitive impulsivity data for thousands of participants around the world. We have been able to run several instances of the CIS task, related to different projects at one time. This has been fundamental to the success of each research project and the ability to efficiently separate participant data. We are grateful for the ongoing support we have received from the research cloud team. They have quickly responded to our needs and have offered valued solutions and technical support to enable each of our projects to run smoothly.
Alexandra Anderson, Addiction and Impulsivity Research (AIR) Lab, Turner Institute for Brain and Mental Health, Monash University.

This article can also be found, published created commons here ³.

Secure safehaven for the ASPREE clinical trial – The need

The year 2017 began with the ASPREE Data Management Team seeking advice on an emerging need for a collaborative sensitive analysis environment. Back then HPC and clouds were very much the realm of categorically non-sensitive data, and secure (“red zoned”) systems were very much the realm of categorically non-collaborative data. This bifurcation was rife in health and social sciences. This Research Cloud engagement with ASPREE was seminal work to transition the Monash research environment towards a continuum (rather than bifurcation) between collaboration and sensitive expectations.

The ASPREE team had a single commodity physical PC located at the ASPREE office in the School of Public Health and Preventive Medicine. Despite the ASPREE team streamlining processes to appropriately allow project collaborations, an innovation in its own right, collaborators could only perform analysis by being physically in the office. The protocol required data custodians to copy ASPREE phenotypic datasets (via USB sticks) onto the PC, whilst also physically disconnecting the ethernet cable to ensure no unintended access. Collaborators would fly into Melbourne just to run their analysis. This logistically-taxing workflow made collaboration hard and significantly delayed research outcomes. As new project requests emerged from ASPREE sub studies, it became apparent that data management, data governance and the analysis ecosystem would need to be revamped to support the growing demand. A scalable and secure “safe haven” was required.

The Research Cloud team approached the situation from a pragmatic point of view. The team first critiqued the scalability of the analysis environment. We discovered the environment would require security-hardening to protect against intentional and unintentional data leakage. Furthermore the interfacet needed improvement to become intuitive for non-academic, external and international collaborators.

Figure 1. A typical ASPREE Analysis Environment

Fortunately Leostream, a remote desktop (VDI) scheduling platform, was already being used by virtual laboratories on the Monash zone of the Research Cloud. Leostream provides a high-level interface for allocating remote desktops to users. It also allows access to these remote desktops through a web-based (HTML5) viewer. To scale out the analysis environment, the team deployed a number of Windows-based instances on the Monash zone of the Research Cloud. These instances have been pre-configured with analytical tools chosen by the ASPREE community (e.g R/RStudio, SAS, SPSS) and connected to the Monash license servers. A typical analysis environment is shown in Figure 1. above. Access is managed through the Monash Active Directory (Domain) and each analysis instances have been configured with Group Policy Objects (GPOs). These GPOs enforced a number of rules or security controls inside the instances, e.g preventing users from changing desktop settings, access to registry tools and much more. A reserved set of hypervisors have been used to host these secure instances, which also reside on a segregated private network. Hyper-threading has been turned-off on the hypervisors to minimise the risk of Spectre/Meltdown-type vulnerabilities.

A high-level architecture diagram for the ASPREE safehaven is shown in Figure 2. Monash eResearch Centre’s Research Data Storage (RDS) provides a scalable storage backend to the safehaven. The team augments the storage pool with further controls to appropriately segregate the data. A dedicated user share is created for each approved ASPREE user. This user share is autonomously mounted into the analysis environment upon user login. ASPREE data custodians (managers) have elevated rights to the safe haven storage. They can review (approve or deny) what data goes in (ingress) and data going out (egress). Thus the technology / workflow automates the overall data governance of the ASPREE clinical trial by incorporating it to their own access management system (AMS).

Now operational for more than 3 years, the Research Cloud at Monash and Helix teams cooperate to provide user support for ASPREE safe havens. Several other registries and clinical trials have leveraged this ASPREE solution as their own safe haven. To date, over 100+ internal and international collaborators have used the ASPREE safe haven. This work has be foundational to Monash eResearch, the Research Cloud and Helix’s initiatives towards the next-generation safe havens (e.g. SeRP, which further automates and audits generalised governance workflows).

“The ASPREE data is an NIH-supported clinical trial, and the NIH rightly demands full accountability for data handling. The team has been understanding, professional, flexible and fast. They gave extra consideration for ASPREE’s urgent need (in 2016-17) to share our large and unique dataset to collaborators, whilst also supporting confidentiality in an active clinical trial. The co-design approach took into consideration our Data Manager’s detailed requirements and produced an excellent environment for effective use and international collaboration centred on ASPREE data. The successfully funded extension study ASPREE-XT depended on getting this right.”
Dr Carlene Britt, ASPREE Senior Research Manager and ASPREE Data Custodian

This article can also be found, published created commons here ¹.