Category Archives: MeRC

iLearn on R@CMon

An integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data : the impact of making machine learning good practice readily available to the community.

Associate Professor Jiangning Song is a long-standing user of the Monash Research Cloud (R@CMon). He is the lead of the Song Lab within the Monash Biomedicine Discovery Institute. Jiangning’s journey began with the deployment of the Protease Specificity Prediction Server – PROSPER app in 2014. Since then the lab has launched more than 30 bioinformatics web services, all of which are made available to research communities worldwide.

Their latest contribution, iLearn, addresses key obstacles to the adoption of machine learning applied to sequencing data. Well-annotated DNA, RNA and protein sequence data is increasingly accessible to all biological researchers. However, at the scale of this data it is challenging if not impossible for an individual to manually investigate. Similarly, another obstacle to broad scale access is that investigation and validation through wet laboratory experiments is time consuming and expensive. Hence when presented appropriately, machine learning can play an import role making higher-level biological data accessible to many researchers in the biosciences.

Many of the previous works and tools only focus on a specific step within a data-processing pipeline. The user is then responsible for chaining these tools together, which in most cases is challenging due to incompatibilities between tools and data formats. iLearn has been designed to address these limitations, using common patterns informed by the lab and its collaborators.

An emerging breakdown of the pipeline steps is:

  • Feature extraction
  • Clustering
  • Normalization
  • Selection
  • Dimensionality reduction
  • Predictor extraction
  • Performance evaluation
  • Ensemble training
  • Results visualisation

iLearn packages these steps for use in two ways. Users can use iLearn through an online environment (web server) or as a stand-alone python toolkit. Whether your interest is in DNA, RNA or protein analysis, iLearn provides a common workflow pattern for all three cases. Users input their sequence data (normally in FASTA format), and then enters various descriptors and parameters for the analysis.

The results page shows the various output, once again informed by the Lab’s good-practices. They can be downloaded from the web server in various formats (e.g CSV, TSV). High quality diagrams and visualisations are also generated by iLearn within the web server:

Since iLearn’s release, more than 5K unique users have used the web server worldwide. The user community and resultant impact continues to grow, with 60 citations since the tool’s seminal publication.

iLearn has been used as an efficient and powerful complementary tool for orchestrating machine-learning-based modelling which in turn improves the speed in biomedical discoveries through genomics and data analysis. As new descriptors get developed and optimised, iLearn aims to incorporate these into future releases to further improve its performance with the R@CMon team providing support to tackle the potential increase in computational and storage complexities.

This article can also be found, published created commons here 1.

Monash Business School Financial Markets Workshop

Last April 30 to May 1, Associate Professor Paul Lajbcygier and Senior Lecturer Huu Nhan Duong from the Monash Business School organised a Financial Markets Workshop at Monash Caulfield Campus, bringing in a number of prominent Australian and international market microstructure researchers as well as high-profile high frequency traders and regulators from the US. The workshop covered several research topics such as “market design and quality”; “high frequency trading”; “volatility and liquidity modelling”; “short selling”; “stock market crashes”; “cryptocurrencies”; and the real effect of financial markets on corporate decisions. The R@CMon team has worked with Paul’s group for several years now, supporting their “big data analysis” workflows on the research cloud. Enabling them to crunch more data, which contributed in several high-impact publications, ARC grant submissions and attainment of a major SEED funding. The international financial workshop event marks the culmination of Paul’s groups accomplishments in high frequency trading research over the years and serves as foundation for future critical mass of research in financial markets. The R@CMon team will continue to support Paul’s group and the Department of Banking and Finance as they work on more high-impact research and in tackling various computational challenges that they may encounter along the journey.

Ceph placement group (PG) scrubbing status

Ceph is our favourite software defined storage system here at R@CMon, underpinning over 2PB of research data as well as the Nectar volume service. This post provides some insight into the one of the many operational aspects of Ceph.

One of the many structures Ceph makes use of to allow intelligent data access as well as reliability and scalability is the Placement Group or PG. What is that exactly? You can find out here, but in a nutshell PGs are used to map pieces of data to physical devices.  One of the functions associated with PGs is ‘scrubbing’ to validate data integrity. Let’s look at how to check the status of PG scrubs.

Let’s find a couple of PGs that map to osd.0 (as their primary):

[admin@mon1 ~]$ ceph pg dump pgs_brief | egrep '\[0,|UP_' | head -5
dumped pgs_brief
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY
57.5dcc active+clean [0,614,1407] 0 [0,614,1407] 0 
57.56f2 active+clean [0,983,515] 0 [0,983,515] 0 
57.55d8 active+clean [0,254,134] 0 [0,254,134] 0 
57.4fa9 active+clean [0,177,732] 0 [0,177,732] 0
[admin@mon1 ~]$

For example, the PG 57.5dcc has an ACTING osd set [0, 614, 1407]. We can check when the PG is scheduled for scrubbing on it’s primary, osd.0:

[root@osd1 admin]# ceph daemon osd.0 dump_scrubs | jq '.[] | select(.pgid |contains ("57.5dcc"))'
{
 "pgid": "57.5dcc",
 "sched_time": "2018-04-11 06:17:39.770544",
 "deadline": "2018-04-24 03:45:39.837065",
 "forced": false
}
[root@osd1 admin]#

Under normal circumstances, the sched_time and deadline are determined automatically by OSD configuration and effectively define a window during which the PG will be next scrubbed. These are the relevant OSD configurables:

[root@osd1 admin]# ceph daemon osd.0 config show | grep scrub | grep interval
 "mon_scrub_interval": "86400",
 "osd_deep_scrub_interval": "2419200.000000",
 "osd_scrub_interval_randomize_ratio": "0.500000",
 "osd_scrub_max_interval": "1209600.000000",
 "osd_scrub_min_interval": "86400.000000",
 [root@osd1 admin]#

[root@osd1 admin]# ceph daemon osd.0 config show | grep osd_max_scrub
 "osd_max_scrubs": "1",
 [root@osd1 admin]#

What happens when we tell the PG to scrub manually?

[admin@mon1 ~]$ ceph pg scrub 57.5dcc
 instructing pg 57.5dcc on osd.0 to scrub
[admin@mon1 ~]$
[root@osd1 admin]# ceph daemon osd.0 dump_scrubs | jq '.[] | select(.pgid |contains ("57.5dcc"))'
 {
 "pgid": "57.5dcc",
 "sched_time": "2018-04-12 17:09:27.481268",
 "deadline": "2018-04-12 17:09:27.481268",
 "forced": true
 }
 [root@osd1 admin]#

The sched_time and deadline have updated to now, and forced has changed to ‘true’. We can also see the state has changed to active+clean+scrubbing:

[admin@mon1 ~]$ ceph pg dump pgs_brief | grep '^57.5dcc'
 dumped pgs_brief
 57.5dcc active+clean+scrubbing [0,614,1407] 0 [0,614,1407] 0
 [admin@mon1 ~]$

Since the osd has osd_max_scrubs configured to 1, what happens if we try to scrub another PG, say 57.56f2:

[root@osd1 admin]# ceph daemon osd.0 dump_scrubs | jq '.[] | select(.pgid |contains ("57.56f2"))'
 {
 "pgid": "57.56f2",
 "sched_time": "2018-04-12 01:45:52.538259",
 "deadline": "2018-04-25 00:57:08.393306",
 "forced": false
 }
 [root@osd1 admin]#

[admin@mon1 ~]$ ceph pg deep-scrub 57.56f2
 instructing pg 57.56f2 on osd.0 to deep-scrub
 [admin@mon1 ~]$

[root@osd1 admin]# ceph daemon osd.0 dump_scrubs | jq '.[] | select(.pgid |contains ("57.56f2"))'
 {
 "pgid": "57.56f2",
 "sched_time": "2018-04-12 17:11:37.908137",
 "deadline": "2018-04-12 17:11:37.908137",
 "forced": true
 }
 [root@osd1 admin]#

[admin@mon1 ~]$ ceph pg dump pgs_brief | grep '^57.56f2'
 dumped pgs_brief
 57.56f2 active+clean [0,983,515] 0 [0,983,515] 0
 [admin@mon1 ~]$

The OSD has updated sched_time, deadline and set ‘forced’ to true as before. But the state is still only active+clean (not scrubbing), because the OSD is configured to process a max of 1 scrub at a time. Soon after the first scrub completes, the second one we initiated begins:

[admin@mon1 ~]$ ceph pg dump pgs_brief | grep '^57.56f2'
 dumped pgs_brief
 57.56f2 active+clean+scrubbing+deep [0,983,515] 0 [0,983,515] 0
 [admin@mon1 ~]$

You will notice after the scrub completes, the sched_time is again updated. The new timestamp is determined by the osd_scrub_min_interval (1 day) and osd_scrub_interval_randomize_ratio (0.5). Effectively, it  randomizes the next scheduled scrub between 1 and 1.5 days since the last scrub:

[root@osd1 admin]# ceph daemon osd.0 dump_scrubs | jq '.[] | select(.pgid |contains ("57.56f2"))'
 {
 "pgid": "57.56f2",
 "sched_time": "2018-04-14 02:37:05.873297",
 "deadline": "2018-04-26 17:36:03.171872",
 "forced": false
 }
 [root@osd1 admin]#

What is not entirely obvious is that a ceph pg repair operation is also a scrub op and lands in the same queue of the primary OSD. In fact, a pg repair is a special kind of deep-scrub that attempts to fix irregularities it finds. For example, lets run a repair on PG 57.5dcc and check the dump_scrubs output:

[root@osd1 admin]# ceph daemon osd.0 dump_scrubs | jq '.[] | select(.pgid |contains ("57.5dcc"))'
{
 "pgid": "57.5dcc",
 "sched_time": "2018-04-14 03:43:29.382655",
 "deadline": "2018-04-26 17:18:37.480484",
 "forced": false
}
[root@osd1 admin]#

[admin@mon1 ~]$ ceph pg dump pgs_brief | grep '^57.5dcc'
dumped pgs_brief
57.5dcc active+clean [0,614,1407] 0 [0,614,1407] 0 
[admin@mon1 ~]$ ceph pg repair 57.5dcc
instructing pg 57.5dcc on osd.0 to repair
[admin@mon1 ~]$ ceph pg dump pgs_brief | grep '^57.5dcc'
dumped pgs_brief
57.5dcc active+clean+scrubbing+deep+repair [0,614,1407] 0 [0,614,1407] 0 
[admin@mon1 ~]$

[root@osd1 admin]# ceph daemon osd.0 dump_scrubs | jq '.[] | select(.pgid |contains ("57.5dcc"))'
{
 "pgid": "57.5dcc",
 "sched_time": "2018-04-13 16:02:58.834489",
 "deadline": "2018-04-13 16:02:58.834489",
 "forced": true
}
[root@osd1 admin]#

This means if you run a pg repair and your PG is not immediately in the repair state, it could be because the OSD is already scrubbing the maximum allowed PGs so it needs to finish those before it can process your PG. A workaround to get the repair processed immediately is to set noscrub and nodeep-scrub, restart the OSD (to stop current scrubs), then run the repair again. This will ensure immediate processing.

In conclusion, the sched_time and deadline from the dump_scrubs output indicate what could be a scrubdeep-scrub, or repair while the forced value indicates if it came from a scrub/repair command.

The only way to tell if next (automatically) scheduled scrub will be a deep-scrub is to get the last deep-scrub timestamp, and work out if osd_deep_scrub_interval will have passed at the time of the next scheduled scrub:

[admin@mon1 ~]$ ceph pg dump | egrep 'PG_STAT|^57.5dcc' | sed -e 's/\([0-9]\{4\}\-[0-9]\{2\}\-[0-9]\{2\}\) /\1@/g' | sed -e 's/ \+/ /g' | cut -d' ' -f1,21
 dumped all
 PG_STAT DEEP_SCRUB_STAMP
 57.5dcc 2018-03-18@03:29:25.128541
 [admin@mon1 ~]$

In this case, the last scrub was almost exactly 4 weeks ago, and the osd_deep_scrub_interval is 2419200 seconds (4 weeks). By the time the next scheduled scrub comes along, the PG will be due for a deep scrub. The dirty sed command above is due to the pg dump output having irregular spacing and spaces in the time stamp 🙂

GlycoMine on R@CMon

Glycosylation is an ubiquitous type of protein post-translational modification (PTM) in eukaryotic cells, which plays vital roles in various biological processes such as cellular communication, ligand recognition and subcellular recognition. It is estimated that greater than 50% of the entire human proteome is glycosylated. However, it is still a significant challenge to identify glycosylation sites, which requires expensive and laborious experimental research. Thus, bioinformatics approaches that can predict the glycan occupancy at specific sequons in protein sequences would be useful for understanding and utilising this very important PTM.

Predicted N-linked glycosylation sites from two case-study proteins using GlycoMine-Struct

Dr. Jiangning Song from the Department of Biochemistry and Molecular Biology at Monash University and his collaborators have designed and developed a bioinformatics tool – GlycoMine-Struct for predicting glycosylation sites. GlycoMine-Struct is a comprehensive tool for the systematic in-silico identification of N-linked and O-linked glycosylation sites in the human proteome. Through R@CMon, a dedicated cloud project with computational and storage resources has been provisioned to develop and host the GlycoMine-Struct tool. The flexible and scalable R@CMon-powered development environment enabled rapid prototyping, testing and re-deployment of the tool.

GlycoMine-Struct Main Page (http://glycomine.erc.monash.edu/Lab/GlycoMine_Struct/index.jsp#Introduction)

GlycoMine-Struct is now a publicly accessible web service, available to the wider research community. Users can now easily submit protein structure input files in PDB (Protein Data Bank) format to perform sites prediction on GlycoMine-Struct. Since it went public, GlycoMine-Struct has been accessed and used by thousands of local and international users, and still growing. A scientific reports paper has been published, highlighting the collaborative work done to develop GlycoMine-Struct, as an essential bioinformatics tool for improving the prediction of human glycosylation sites. The R@CMon team is actively supporting the GlycoMine-Struct project as it continues to serve the research community and develop performance improvements.

XCMSplus Metabolomics Analysis on R@CMon

At the start of 2017, the R@CMon team had its first user consultation with Dr. Sri Ramarathinam, a research fellow from the Immunproteomics Laboratory (Purcell Laboratory) at the School of Biomedical Sciences in Monash University. Sri and his group at the lab studies metabolomics compounds in various samples by conducting a “search” and “identification” process using a pipeline of analysis and visualisation tools. The lab has acquired the license to use the commercial XCMSPlus metabolomics platform from SCIEX on their workflow. XCMSPlus provides a powerful solution for analysis of untargeted metabolomics data in a stand-alone configuration, which will greatly increase the lab’s capacity to analyse more samples, with faster and easeful results generation and interpretation.

XCMSPlus main login Page, entry point of the complete metabolomics platform

During the first engagement meeting with Sri and the lab, it’s been highlighted that a specialised hosting platform (with appropriate storage and computational capacity) would be required for XCMSPlus. XCMSPlus is distributed as stand-alone appliance (personal cloud) from the vendor. As an appliance, XCMSPlus has been optimised and packaged to be deployed on a single, multi-core and high-memory machine. An added minor complication is that this appliance was distributed in VMWare’s appliance format, which need to be translated into an OpenStack-friendly format. The R@CMon team provided the hosting platform required for XCMSPlus through the Monash node of the Nectar Research Cloud.

Analysis results and visualisation in XCMSPlus

A dedicated Nectar project has been provisioned for the lab, which is now being used for hosting XCMSPlus. This project also has enough capacity for future expansion and new analysis platform deployments. The now R@CMon-hosted (and supported) XCMSPlus platform for the Immunproteomics Laboratory is the first custom XCMSPlus deployment in Australia. Due to being the first in Australia, there were some early minor issues encountered during its first test runs. These technical issues were eventually sorted out due to collaborative troubleshooting efforts from the R@CM team, the lab and the vendor. And after several months of usage, hundred of jobs submitted and processed by XCMSPlus, and counting, the lab is continuing to fully integrate it as part of their analysis workflow. The R@CMon team is actively engaging with the lab for supporting its adaption of XCMSPlus and planning for future analysis workflow expansions.

HIVed Database on R@CMon

Measuring the changes in gene expressions levels and determining differential expressed genes during the processes of human immunodeficiency virus (HIV) infection, replication and latency is instrumental in further understanding HIV infections. These measurements or studies are vital in developing strategies for virus eradication from the human body. Dr. Chen Li, a research fellow from the Immunoproteomics Laboratory at Monash University has developed a novel compendium of comprehensive functional genes annotations from genes expressions and proteomics studies. The genes in the compendium have been carefully curated and shown to be differentially expressed during HIV infection, replication and latency.

The HIVed Online Database, Front Page

The R@CMon team assisted with the deployment of the online database – HIVed on the Monash node of the NeCTAR Research Cloud. The system has been running on R@CMon and serving the public community for more than a year. HIVed is considered to be the first fully comprehensive database that combines datasets from a wide range of experimental studies that have been carefully curated using a variety of experimental conditions. The datasets are further enriched by integrating it with other public databases to provide additional annotations for each data points. The HIVed online database has been developed to facilitate the functional annotation and experimental hypothesis HIV related genes with an intuitive web interface which enables dynamic display or presentation of common threads across HIV latency and infection conditions and measurements. The work done for the development of HIVed has been recently published into Scientific Reports and the Immunoproteomics Laboratory has plans to incorporate new experimental studies and external annotations into the HIVed database as they become available.

Melbourne Weather Server on R@CMon

Dr. Simon Clarke is a senior lecturer from the School of Mathematical Sciences at Monash University. He’s been granted permission by the Bureau of Meteorology to repackage its observational and forecasting Melbourne weather data for downstream analysis and visualisations. Weather data from the bureau is downloaded and processed at regular 10 minute intervals. Various metrics and visualisations are then computed using a custom-made MATLAB batch script developed in-house. The resulting output is then fed to a web server for public presentation with integrations to other external sites hosted by the bureau. The original weather server was housed on a “legacy” hosting platform, that had reached its end of life. The Melbourne weather server needed a new home.

Melbourne Weather Server

The R@CMon team engaged with Simon to scope the various weather server’s hosting requirements. Aside from the traditional LAMP-style type of hosting required, the server also needed direct access to MATLAB’s batch mode functionality. A new R@CMon-hosted instance was deployed on the Monash node of the NeCTAR Research Cloud. With it, a standard LAMP stack was also installed and configured. A Monash University-licensed installation of MATLAB has been made available onto the new weather server, allowing the downstream analysis of the raw data from the bureau to be conducted.

Melbourne Weather Server Visits for 2017

The new Melbourne weather server is now publicly accessible and available across world. The regular live feed is serving the Australian and international community providing live Melbourne weather observations and forecasts. With the support of the R@CMon team, it will continue to do so for more years to come.