Tag Archives: Python

iLearn on R@CMon

An integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data : the impact of making machine learning good practice readily available to the community.

Associate Professor Jiangning Song is a long-standing user of the Monash Research Cloud (R@CMon). He is the lead of the Song Lab within the Monash Biomedicine Discovery Institute. Jiangning’s journey began with the deployment of the Protease Specificity Prediction Server – PROSPER app in 2014. Since then the lab has launched more than 30 bioinformatics web services, all of which are made available to research communities worldwide.

Their latest contribution, iLearn, addresses key obstacles to the adoption of machine learning applied to sequencing data. Well-annotated DNA, RNA and protein sequence data is increasingly accessible to all biological researchers. However, at the scale of this data it is challenging if not impossible for an individual to manually investigate. Similarly, another obstacle to broad scale access is that investigation and validation through wet laboratory experiments is time consuming and expensive. Hence when presented appropriately, machine learning can play an import role making higher-level biological data accessible to many researchers in the biosciences.

Many of the previous works and tools only focus on a specific step within a data-processing pipeline. The user is then responsible for chaining these tools together, which in most cases is challenging due to incompatibilities between tools and data formats. iLearn has been designed to address these limitations, using common patterns informed by the lab and its collaborators.

An emerging breakdown of the pipeline steps is:

Feature extraction
Clustering
Normalization
Selection
Dimensionality reduction
Predictor extraction
Performance evaluation
Ensemble training
Results visualisation

iLearn packages these steps for use in two ways. Users can use iLearn through an online environment (web server) or as a stand-alone python toolkit. Whether your interest is in DNA, RNA or protein analysis, iLearn provides a common workflow pattern for all three cases. Users input their sequence data (normally in FASTA format), and then enters various descriptors and parameters for the analysis.

The results page shows the various output, once again informed by the Lab’s good-practices. They can be downloaded from the web server in various formats (e.g CSV, TSV). High quality diagrams and visualisations are also generated by iLearn within the web server:

Since iLearn’s release, more than 5K unique users have used the web server worldwide. The user community and resultant impact continues to grow, with 60 citations since the tool’s seminal publication.

iLearn has been used as an efficient and powerful complementary tool for orchestrating machine-learning-based modelling which in turn improves the speed in biomedical discoveries through genomics and data analysis. As new descriptors get developed and optimised, iLearn aims to incorporate these into future releases to further improve its performance with the R@CMon team providing support to tackle the potential increase in computational and storage complexities.

This article can also be found, published created commons here ¹.

Monash Data Science on R@CMon

Back in 2015, the Faculty of Information Technology at Monash University has started exploring various data science platforms that are easily available on the web. Many of its researchers including lecturers have used interactive Python and R notebooks on their own desktops and laptops for small and medium-size kind of problems. These interactive notebooks provides ease-of-use, portability and collaboration tools. Very useful features that the faculty decided to use them for teaching and have the software stack installed on the teaching labs computers. Data science courses can then be done on these labs where students run their analyses on the notebook instances running on each lab machines. For sometime, this setup has served the faculty’s teaching requirements really well, but as the number of students grow and more advanced and complex problems are tackled, it has become apparent, that a more scalable and highly-available data science platform is needed.

Training Dataset Visualisation in JupyterHub

The R@CMon team started a journey with the faculty’s staff to evaluate the already available data science platforms. The team first deployed SageMathCloud (SMC) on the Monash node of the NeCTAR Research Cloud and assessed it for a couple of months. SageMath and its cloud version – SageMathCloud (SMC) are open-source platforms for mathematical and scientific analyses. It provides a similar intuitive and interactive interface for running models and generating visualisations. The most attractive feature of SMC is that it’s been developed as a teaching platform from the outset, so various plugins for teacher-student interactions were already developed and available, for example: notebook sharing and marking. Although SMC is open-source, the R@CMon team encountered various setup and deployment issues. The team was able to deploy a basic setup of SMC eventually with key features. The developers and maintainers of SMC have been consulted for support but didn’t at that time support private deployments. The next available data science platform was then assessed.

Samples Distribution Visualisation in JupyterHub

The team then moved on to evaluate IBM’s Data Science Workbench (DSW) platform. DSW is not open-sourced and cannot be deployed privately on the research cloud, but at that time, DSW had the requisite analytic (e.g. Python, R) and collaboration features. DSW was used by the faculty to deliver a number of teaching courses. However, after several rounds of teaching courses, licensing issues caused teachers and students to be unable login to DSW, as well as running notebooks crashing. These issues led the faculty to resume the search for another data science platform.

Features Correlation Visualisation in JupyterHub

JupyterHub is a multi-user system for serving interactive notebooks. It provides a comprehensive documentation for various type of deployments and scaling options. Since its inception, JupyterHub has become mainstream in various teaching and research communities. For example, there were some early adopters of JupyterHub for education from UC Berkley. JupyterHub has been used also to provide a publicly accessible and re-runnable model in Nature. These early adopters inspired the R@CMon team and faculty staff to replicate their success stories in the then being developed online course of Graduate Diploma for Data Science.

R Classification Visualisation in JupyterHub

The R@CMon team deployed an instance of JupyterHub locally on the Monash node of the NeCTAR Research Cloud. The team then coordinated with the relevant lecturers for the configuration of various Python and R libraries (e.g numpy, scipy, ggplots, matplotlib) that will be used for the units. To support a more dynamic user management of JupyterHub, the R@CMon team has integrated it with the Monash User Directory service. This enabled easier addition and removal of users from the system, plus users can use their own Monash credentials to access JupyterHub and do their analysis. To date, and after ~2 years of usage, the R@CMon-hosted JupyterHub service has gone several rounds of teaching periods and served hundreds of students. The R@CMon team is actively engaging with the faculty for future directions in delivering new content (e.g. PySpark) and preparing for the next and more exciting forms of interactive analyses (e.g JupyterLab).

Upcoming Software Carpentry Bootcamp for Bioinformaticians (Adelaide/Melbourne)