Category Archives: MeRC

Revisiting the next generation of StockPrice infrastructure

For many facets of our lives, long-term public good relies on a healthy tension between competition and stability. The age of digital disruption has profoundly changed the nature of competition in financial markets, to the extent that regulation has not always been adequate to ensure stability. Associate Professor Paul Lajbcygier and his colleague Rohan Fletcher from the Monash Business School are custodians of a longitudinal study seeking to understand when stability has been superseded by innovations. To their surprise, the recent Nectar Research Cloud refresh has caused a digital disruption to their own research, reducing analysis time from many months to one week, and in turn changing the focus of research.

How deep does the digital disruption rabbit hole go? We asked Paul and Rohan to tell us about it…

“With the advent of the IT revolution, financial exchanges have changed beyond recognition. With the Centre’s help, we have focused our research on how the digital disruption has affected financial markets, considered welfare implications, and potential regulatory changes, with ramifications for regulators, traders, superannuants and all equity market stakeholders.”

Associate Professor Paul Lajbcygier

Recently, the Monash eResearch centre has supported Paul’s team by upgrading new essential hardware and infrastructure necessary to interrogate the vast data generated from the Australian equity markets.

“Without the Centre’s help, our research would be impossible”.

Associate Professor Paul Lajbcygier

To get an understanding how the refreshed Research Cloud would affect the team, they benchmarked the new hardware against their data. They rerun database MySQL code which searches vast amounts of ASX stock data in order to understand the costs of stock trading using new, innovative price impact models. That prior work led to an A* publication in 2020 in the Journal of Economics, Dynamics and Control 1.

This analysis  interrogates over 1300 stock’s and their related trades and orders from 2007 to 2013, representing three terabytes of data. 

“In order to implement this huge processing task, we have automated the breakdown of these MySQL queries by stock, year and month. This generates over 80,000 SQL scripts, which is a total of 3 gigabytes of SQL analysis code alone.”

With the latest hardware provided by Monash eResearch centre, this query took around one week, in contrast to the many months of required running time prior to the provision of the latest hardware.”

Associate Professor Paul Lajbcygier
The architecture of the embarrassingly parallel Stock Price infrastructure hosted on the Nectar Research Cloud consists of a large memory machine running Ubuntu LTS, and a number of smaller staging, analysis machines based also on Ubuntu LTS and Windows.

The project uses MySQL and bash scripts to facilitate the templating of SQL jobs. Scripts are generated on the small VM and then moved to the big memory machine for execution. The number of scripts on the big memory machine is monitored and is held at a pre-set maximum. On completion of a script on the MySQL database the next available script is launched automatically.

Testing of a single script may be performed on the staging database after some automatic modifications have been made to make it compatible for individual execution.

Below is an example chart comparing the execution time for an SQL script when utilising each of the four underlying database storage technologies now available on a big memory machine on the Monash node of the Research Cloud. These being: MySQL’s Memory Storage Engine, utilising RAM Drive, utilising Flash Drive and utilising a mounted volume (a separate Ceph cluster via RoCE). Clearly, the use of the memory engine (blue line) provides the best performance. 

“Repeating our published benchmark, the MySQL memory engine is approximately forty times better than the flash drive and mounted volume. This outstanding Memory engine performance occurs because the memory engine is internal to MySQL, thereby avoiding input/output lags required of the file system.”

Associate Professor Paul Lajbcygier
Stock Price search performance based on storage backends.

The team then sought observables to explain why the memory approach performed so well. Below is an example of system load as recorded using Ganglia running the same SQL query. To the left is the Memory Engine CPU load usage, followed by examples of the Flash Drive, Mounted Volume and RAM Drive CPU load usage. 

“It is possible to see that the Memory engine utilises all 120 CPU processes consistently, in contrast to the right hand graph which shows other memory methods which do not efficiently utilise the new hardware and incur overheads due to the requirement that they must use the file system.”

Associate Professor Paul Lajbcygier

In addition to fine tuning MySQL to the specialist hardware (what they named the “Big Memory Machine”), the benchmarking necessitated the integration of a bespoke Microsoft Windows ecosystem of tools. They used the open source tool HeidiSQL, to both visualise and automate  the decomposition of the analysis problem to 120 parallel executing SQL scripts.

Parallel Stock Price SQL executions.

To summarise, We’ve asked Paul what the overall impact of the revamped Stock Price infrastructure in answering their research questions.

“We’re able to utilise data analysis on a more comprehensive data set including the ASX and the US NASDAQ, perform rapid prototyping with quick feedback; and complete analyses that would be intractable using the previous infrastructure.”

Associate Professor Paul Lajbcygier

This article can also be found, published created commons here 2.

Triple Mechanism Cognitive Impulsivity Battery

Professor Antonio Verdejo Garcia and colleagues from the Verdejo-Garcia Laboratory at Turner Institute for Brain and Mental Health, engaged a local video game developer – TorusGames, to write computer games that are used to better understand impulsivity (a common symptom of substance addictions, obesity and eating disorders). We helped migrate this application to the Research Cloud, where barriers to infrastructure scaling, reuse and appropriate data governance have been removed. Back in 2018 the lab asked for advice on publishing apps online. It turned out the applications are a battery of interactive web applications that are designed to measure cognitive impulsivity of its users. This project was supported by an ARC Linkage Project (LP150100770), which aimed to study and measure the cognitive skills that can produce (or avoid) impulsive human behaviour.

The project engaged a local 3rd party developer (Torus Games) to develop the games (ahem, cognitive applications) to a pre-production level using Google’s Firebase platform (enables developers to develop iOS, Android and Web apps easier). Our job was to help the lab to migrate the application into the Nectar Research Cloud at Monash (R@CMon), which also meant mapping a pathway away from Firebase. The project and its data custodians have full governance on the data captured by the applications as part of their data collection activities.

The resulting application suite, the “Cognitive Impulsivity Suite (CIS)”, is a series of connected services. The web-app itself is a Unity-based WebGL build with a RESTful API using an ASP.NET backend. The app stores the user observations (generated “measures”) from the “trials” into a relational database backend. A Windows-based server is required to host the .NET-based application using the Internet Information Services (IIS) web server. R@CMon provided the required cloud resources and web configuration (.Net, IIS) to migrate the “CIS” application suite from Google Firebase. A high level pipeline diagram of CIS deployment is shown below.

The Cognitive Impulsivity Suite (CIS) pipeline on the Monash Research Cloud

3 years on, the lab is currently conducting 5 major projects using the “CIS” infrastructure on the Research Cloud. The largest of these studies assess impulsivity in  more than 1000 US and Australian help-seeking and anonymous participants with drug, alcohol and gambling problems. There are 2 articles 1 2 that have been published recently from various impulsivity studies. These research outcomes have utilised the full capabilities of the “Cognitive Impulsivity Suite (CIS)” on the Research Cloud.

The research cloud has enabled us to reliably collect and store cognitive impulsivity data for thousands of participants around the world. We have been able to run several instances of the CIS task, related to different projects at one time. This has been fundamental to the success of each research project and the ability to efficiently separate participant data. We are grateful for the ongoing support we have received from the research cloud team. They have quickly responded to our needs and have offered valued solutions and technical support to enable each of our projects to run smoothly.

Alexandra Anderson, Addiction and Impulsivity Research (AIR) Lab, Turner Institute for Brain and Mental Health, Monash University.

This article can also be found, published created commons here 3.

Monash University, NVIDIA and ARDC partner to explore the offloading of security in collaborative research applications

Collaboration in the research sector (universities) has an impact on infrastructure that is a microcosm for the future Internet. 

Why is this? Researchers are increasingly connected, increasingly participating in grand challenge problems, and increasingly reliant on technology. Problem solving for big global challenges, as distinct from fundamental research, can involve large-scale human-related data, which is sensitive and sometimes commercial-in-confidence. Researchers are rewarded to be first to discovery. One way to accelerate discovery is to be the “first to market” with disruptive technology. That is, develop the foundational research discovery tool (think software or instrument that provides the unique lens to see the solution, a “21st century microscope” so to speak). If we think of research communities as instrument designers and builders, they must then build the scientific applications that span the Internet (across local infrastructure, public cloud and edge devices). 

What is an example 21st century microscope for a mission-based problem? To prove the effectiveness of an experimental machine learning based algorithm running on an NVIDIA Jetson-connected edge device controlling a building’s battery. It’s informed by bleeding-edge economics theory, participates in a microgrid of power generators (e.g. solar), storage and consumers (buildings) at the scale of a small city, and is itself connected to the local power grid. Through the Smart Energy City project within the Net Zero Initiative we are doing just that.

A tension is observed between mission-based endeavours involving researchers from any number of organisations, and the responsibility for data governance, which ultimately resides with each researcher’s organisation. Contemporary best practices in technological and process controls adds more work to researchers and technology alike, potentially slowing research down. And yet cyber threats are an exponential reality. It cannot be ignored. How do we make it safe and easy for researchers to explore and develop instruments in this ecosystem? How do we create an environment that scales to any number of research missions? 

What is the technological and process approach that enables a globe’s worth of individual research contributions to mission-based problems that will also scale with the evolving cyber landscape?

In February, NVIDIA, Monash University’s eResearch Centre, Monash University’s Cyber Risk & Resilience team and the Australian Research Data Commons (ARDC), commenced a partnership to explore the role DPUs play in this microcosm. Monash now hosts ten NVIDIA BlueField-2 DPUs residing in its Research Cloud, essentially a private cloud, which itself forms part of the ARDC Nectar Research Cloud, Australia’s federated research cloud, which is funded through the National Collaborative Research Infrastructure Strategy (NCRIS). The partnership is to explore the paradigm of off-loading (what is ultimately) micro-segmentation onto DPUs, thus removing the burden of increased security from CPUs, GPUs and top-of-rack / top-of-organisation security appliances. Concurrently Monash is exploring a range of contemporary appliances, microsegmentation software and automations of research data governance.

Steve Quenette, Deputy Director of the Monash eResearch Centre and lead of this project states:

“Micro-segmenting per-research application would ultimately enable specific datasets to be controlled tightly (more appropriately firewalled) and actively & deeply monitored, as the data traverses a researcher’s computer, edge devices, safe havens, storage, clouds and HPC. We’re exploring the idea that the boundaries of data governance are micro-segmented, not the organisation or infrastructures. By offloading technology and processes to achieve security, the shadow-cost of security (as felt by the researcher, e.g. application hardening and lost processing time) is minimised, whilst increasing the transparency and controls of each organisation’s SOC. It is a win-win to all parties involved.”

Dan Maslin, Monash University Chief Information Security Officer:

“As we continue to push the boundaries of research technology, it’s important that we explore new and innovative ways that utilise bleeding edge technology to protect both our research data and underpinning infrastructure. This partnership and the exploratory use of DPUs is exciting for both Monash University and the industry more broadly.”

Carmel Walsh, Director eResearch Infrastructure & Service, ARDC:

“To support research at a national and international level requires investment in leading edge technology. The ARDC is excited to partner with the Monash eResearch Centre and NVIDIA to explore how to apply DPUs to research computing and how to scale this technology nationally to provide our Australian researchers with the competitive advantage.”

This is an example of the emerging evolution in security technology to security everywhere or distributed security. By shifting the security function as orthogonal to the application (including the operating system), the data centre (Monash in this case) can affect it’s own chosen depth introspection and enforcement, at the same rate that clouds and applications are growing.

“The transformation of the data center into the new unit of computing demands zero-trust security models that monitor all data center transactions in real time,” said Ami Badani, Vice President of Marketing at NVIDIA. “NVIDIA is collaborating with Monash University on pioneering cybersecurity breakthroughs powered by the NVIDIA Morpheus AI cybersecurity framework, which uses machine learning to anticipate threats with real-time, all-packet inspection.”

We are presently forming the team involving cloud and security office staff, and performing preliminary investigations in our test cloud. We’re expecting to communicate findings incrementally over the year.

Secure safehaven for the ASPREE clinical trial – The need

The year 2017 began with the ASPREE Data Management Team seeking advice on an emerging need for a collaborative sensitive analysis environment. Back then HPC and clouds were very much the realm of categorically non-sensitive data, and secure (“red zoned”) systems were very much the realm of categorically non-collaborative data. This bifurcation was rife in health and social sciences. This Research Cloud engagement with ASPREE was seminal work to transition the Monash research environment towards a continuum (rather than bifurcation) between collaboration and sensitive expectations.

The ASPREE team had a single commodity physical PC located at the ASPREE office in the School of Public Health and Preventive Medicine. Despite the ASPREE team streamlining processes to appropriately allow project collaborations, an innovation in its own right, collaborators could only perform analysis by being physically in the office. The protocol required data custodians to copy ASPREE phenotypic datasets (via USB sticks) onto the PC, whilst also physically disconnecting the ethernet cable to ensure no unintended access. Collaborators would fly into Melbourne just to run their analysis. This logistically-taxing workflow made collaboration hard and significantly delayed research outcomes. As new project requests emerged from ASPREE sub studies, it became apparent that data management, data governance and the analysis ecosystem would need to be revamped to support the growing demand. A scalable and secure “safe haven” was required.

The Research Cloud team approached the situation from a pragmatic point of view. The team first critiqued the scalability of the analysis environment. We discovered the environment would require security-hardening to protect against intentional and unintentional data leakage. Furthermore the interfacet needed improvement to become intuitive for non-academic, external and international collaborators.

Figure 1. A typical ASPREE Analysis Environment

Fortunately Leostream, a remote desktop (VDI) scheduling platform, was already being used by virtual laboratories on the Monash zone of the Research Cloud. Leostream provides a high-level interface for allocating remote desktops to users. It also allows access to these remote desktops through a web-based (HTML5) viewer. To scale out the analysis environment, the team deployed a number of Windows-based instances on the Monash zone of the Research Cloud. These instances have been pre-configured with analytical tools chosen by the ASPREE community (e.g R/RStudio, SAS, SPSS) and connected to the Monash license servers. A typical analysis environment is shown in Figure 1. above. Access is managed through the Monash Active Directory (Domain) and each analysis instances have been configured with Group Policy Objects (GPOs). These GPOs enforced a number of rules or security controls inside the instances, e.g preventing users from changing desktop settings, access to registry tools and much more. A reserved set of hypervisors have been used to host these secure instances, which also reside on a segregated private network. Hyper-threading has been turned-off on the hypervisors to minimise the risk of Spectre/Meltdown-type vulnerabilities.

Figure 2. ASPREE safehaven architecture

A high-level architecture diagram for the ASPREE safehaven is shown in Figure 2. Monash eResearch Centre’s Research Data Storage (RDS) provides a scalable storage backend to the safehaven. The team augments the storage pool with further controls to appropriately segregate the data. A dedicated user share is created for each approved ASPREE user. This user share is autonomously mounted into the analysis environment upon user login. ASPREE data custodians (managers) have elevated rights to the safe haven storage. They can review (approve or deny) what data goes in (ingress) and data going out (egress). Thus the technology / workflow automates the overall data governance of the ASPREE clinical trial by incorporating it to their own access management system (AMS).

Now operational for more than 3 years, the Research Cloud at Monash and Helix teams cooperate to provide user support for ASPREE safe havens. Several other registries and clinical trials have leveraged this ASPREE solution as their own safe haven. To date, over 100+ internal and international collaborators have used the ASPREE safe haven. This work has be foundational to Monash eResearch, the Research Cloud and Helix’s initiatives towards the next-generation safe havens (e.g. SeRP, which further automates and audits generalised governance workflows).

“The ASPREE data is an NIH-supported clinical trial, and the NIH rightly demands full accountability for data handling. The team has been understanding, professional, flexible and fast. They gave extra consideration for ASPREE’s urgent need (in 2016-17) to share our large and unique dataset to collaborators, whilst also supporting confidentiality in an active clinical trial. The co-design approach took into consideration our Data Manager’s detailed requirements and produced an excellent environment for effective use and international collaboration centred on ASPREE data. The successfully funded extension study ASPREE-XT depended on getting this right.”

Dr Carlene Britt, ASPREE Senior Research Manager and ASPREE Data Custodian

This article can also be found, published created commons here 1.

Kaptive – How novel searches within bacterial genomic data are presented and hosted on R@CMon

Dr. Kelly Wyres is a research fellow in the Holt Lab. Kelly first approached the Research Cloud at Monash team in 2019. She sought assistance to migrate their bioinformatics web application – Kaptive to Monash infrastructure. Kaptive is a user-friendly tool for finding known loci within one or more pre-assembled genomes, specifically for the identification of Klebsiella surface polysaccharide loci. It presents these results in a novel and intuitive web interface, helping the user to rapidly gain confidence in locus matches. Kaptive has been developed and currently maintained by Kelly Wyres, Ryan Wick and Kathryn Holt at Monash University. It also uses bacterial reference databases that are carefully curated by Kelly Wyres and Johanna Kenyon from Queensland University of Technology.

Wick RR, Heinz E, Holt KE and Wyres KL 2018. Kaptive Web: user-friendly capsule and lipopolysaccharide serotype prediction for Klebsiella genomes. Journal of Clinical Microbiology: 56(6). e00197-18

The R@CMon team provided its standard LAMP platform to host Kaptive on the Research Cloud. This included helping Kelly transition Kaptive from its original web2py mechanism (to quickly create web applications), to a production grade LAMP stack including a dedicated web server and storage backend. Now transitioned, the team can efficiently and effectively cooperate with Kaptive alongside a critical mass of other domain-specific LAMP based applications across all disciplines of research. The team also assisted in applying additional security controls (e.g HTTPS/SSL, reCAPTCHA) on the server to improve its security posture. As a measure of impact, more than 3000 searches (and associated computing jobs) have been submitted into Kaptive to date. As new reference databases become ready and curated, it’ll then be incorporated into Kaptive and made available to the research community.

This article can also be found, published created commons here 1.

Co-designing clouds for the data future of fintech : the next generation of StockPrice infrastructure

We first discussed the emergence of “big data”, and its impact on computing and storage needs, with Associate Professor Paul Lajbcygier and his team in 2014. The Research Cloud at Monash initial engagement enabled the “Stock Price Impact Models Study” to get off the ground with immediate high-impact research output. A few months later, in 2015, we’ve showcased their incremental update to the study “Stock Price Impact Models Study on R@CMon Phase 2 (Update)”, which produced another high-impact publication. Then in 2018, Associate Professor Paul Lajbcygier and Senior Lecturer Huu Nhan Duong held the “Monash workshop on financial markets” at the Monash University, attracting highly prominent Australian and international researchers to talk about topics such as “market design and quality”; “high frequency trading”; “volatility and liquidity modelling”; and many more.

Pham, Manh Cuong and Duong, Huu Nhan and Lajbcygier, Paul, A Comparison of the Forecasting Ability of Immediate Price Impact Models (September 18, 2015). Available at SSRN: https://ssrn.com/abstract=2515667 or http://dx.doi.org/10.2139/ssrn.2515667

Fast forward to 2020 and despite the current world and local circumstances, Paul and his team continue to excel in producing more high impact research outcomes. Their recent successes include a “Journal of Economic Dynamics and Control” publication entitled “The effects of trade size and market depth on immediate price impact in a limit order book market” and an Interfaculty Seeding Grant with the Monash Business School and Faculty of Information Technology to study high frequency trading using machine learning methodologies. There are also numerous research outputs to be submitted towards the end of 2020 and many more towards Q1 of 2021. This surge in high impact outputs correlates to a recent optimisation in the way big queries are executed on the memory engine of the underlying R@CMon-hosted database.

The speed up compared to previous data runs is around four times. This means we can now use more of the memory in the big memory machine effectively.

Paul Lajbcygier, Associate Professor, Banking & Finance, Monash Data Futures Institute

The R@CMon team are currently preparing for the next round of cloud resources uplift in 2021 where “persistent memory” (e.g Intel Optane DC) components are being considered to be included in the resource pool (flavours) available to research cloud users. This could provide even more substantial speedups to big queries on stock price big data. Once ready, the R@CMon team will engage Paul’s team again to utilise these resources.

This article can also be found, published created commons here 1.

iLearn on R@CMon

An integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data : the impact of making machine learning good practice readily available to the community.

Associate Professor Jiangning Song is a long-standing user of the Monash Research Cloud (R@CMon). He is the lead of the Song Lab within the Monash Biomedicine Discovery Institute. Jiangning’s journey began with the deployment of the Protease Specificity Prediction Server – PROSPER app in 2014. Since then the lab has launched more than 30 bioinformatics web services, all of which are made available to research communities worldwide.

Their latest contribution, iLearn, addresses key obstacles to the adoption of machine learning applied to sequencing data. Well-annotated DNA, RNA and protein sequence data is increasingly accessible to all biological researchers. However, at the scale of this data it is challenging if not impossible for an individual to manually investigate. Similarly, another obstacle to broad scale access is that investigation and validation through wet laboratory experiments is time consuming and expensive. Hence when presented appropriately, machine learning can play an import role making higher-level biological data accessible to many researchers in the biosciences.

Many of the previous works and tools only focus on a specific step within a data-processing pipeline. The user is then responsible for chaining these tools together, which in most cases is challenging due to incompatibilities between tools and data formats. iLearn has been designed to address these limitations, using common patterns informed by the lab and its collaborators.

An emerging breakdown of the pipeline steps is:

  • Feature extraction
  • Clustering
  • Normalization
  • Selection
  • Dimensionality reduction
  • Predictor extraction
  • Performance evaluation
  • Ensemble training
  • Results visualisation

iLearn packages these steps for use in two ways. Users can use iLearn through an online environment (web server) or as a stand-alone python toolkit. Whether your interest is in DNA, RNA or protein analysis, iLearn provides a common workflow pattern for all three cases. Users input their sequence data (normally in FASTA format), and then enters various descriptors and parameters for the analysis.

The results page shows the various output, once again informed by the Lab’s good-practices. They can be downloaded from the web server in various formats (e.g CSV, TSV). High quality diagrams and visualisations are also generated by iLearn within the web server:

Since iLearn’s release, more than 5K unique users have used the web server worldwide. The user community and resultant impact continues to grow, with 60 citations since the tool’s seminal publication.

iLearn has been used as an efficient and powerful complementary tool for orchestrating machine-learning-based modelling which in turn improves the speed in biomedical discoveries through genomics and data analysis. As new descriptors get developed and optimised, iLearn aims to incorporate these into future releases to further improve its performance with the R@CMon team providing support to tackle the potential increase in computational and storage complexities.

This article can also be found, published created commons here 1.