Monash Data Science on R@CMon

Back in 2015, the Faculty of Information Technology at Monash University has started exploring various data science platforms that are easily available on the web. Many of its researchers including lecturers have used interactive Python and R notebooks on their own desktops and laptops for small and medium-size kind of problems. These interactive notebooks provides ease-of-use, portability and collaboration tools. Very useful features that the faculty decided to use them for teaching and have the software stack installed on the teaching labs computers. Data science courses can then be done on these labs where students run their analyses on the notebook instances running on each lab machines. For sometime, this setup has served the faculty’s teaching requirements really well, but as the number of students grow and more advanced and complex problems are tackled, it has become apparent, that a more scalable and highly-available data science platform is needed.

Training Dataset Visualisation in JupyterHub

The R@CMon team started a journey with the faculty’s staff to evaluate the already available data science platforms. The team first deployed SageMathCloud (SMC) on the Monash node of the NeCTAR Research Cloud and assessed it for a couple of months. SageMath and its cloud version – SageMathCloud (SMC) are open-source platforms for mathematical and scientific analyses. It provides a similar intuitive and interactive interface for running models and generating visualisations. The most attractive feature of SMC is that it’s been developed as a teaching platform from the outset, so various plugins for teacher-student interactions were already developed and available, for example: notebook sharing and marking. Although SMC is open-source,  the R@CMon team encountered various setup and deployment issues. The team was able to deploy a basic setup of SMC eventually with key features. The developers and maintainers of SMC have been consulted for support but didn’t at that time support private deployments. The next available data science platform was then assessed.

Samples Distribution Visualisation in JupyterHub

The team then moved on to evaluate IBM’s Data Science Workbench (DSW) platform. DSW is not open-sourced and cannot be deployed privately on the research cloud, but at that time, DSW had the requisite analytic  (e.g. Python, R) and collaboration features.  DSW was used by the faculty to deliver a number of teaching courses. However, after several rounds of teaching courses, licensing issues caused teachers and students to be unable login to DSW, as well as running notebooks crashing.  These issues led the faculty to resume the search for another data science platform.

Features Correlation Visualisation in JupyterHub

JupyterHub is a multi-user system for serving interactive notebooks. It provides a comprehensive documentation for various type of deployments and scaling options. Since its inception, JupyterHub has become mainstream in various teaching and research communities. For example, there were some early adopters of JupyterHub for education from UC Berkley. JupyterHub has been used also to provide a publicly accessible and re-runnable model in Nature. These early adopters inspired the R@CMon team and faculty staff to replicate their success stories in the then being developed online course of Graduate Diploma for Data Science.

R Classification Visualisation in JupyterHub

The R@CMon team deployed an instance of JupyterHub locally on the Monash node of the NeCTAR Research Cloud. The team then coordinated with the relevant lecturers for the configuration of various Python and R libraries (e.g numpy, scipy, ggplots, matplotlib) that will be used for the units. To support a more dynamic user management of JupyterHub, the R@CMon team has integrated it with the Monash User Directory service. This enabled easier addition and removal of users from the system, plus users can use their own Monash credentials to access JupyterHub and do their analysis. To date, and after ~2 years of usage, the R@CMon-hosted JupyterHub service has gone several rounds of teaching periods and served hundreds of students. The R@CMon team is actively engaging with the faculty for future directions in delivering new content (e.g. PySpark) and preparing for the next and more exciting forms of interactive analyses (e.g JupyterLab).