Tag Archives: Using the Cloud

MyTardis for Genomics

The Monash Bioinformatics Platform has recently partnered with MyTardis, the Characterisation Virtual Laboratory (CVL) and R@CMon to develop an automated, structured and managed the overall data management pipeline of sequencing results for the Monash Health Translational Precinct (MHTP) Medical Genomics Facility. The result is the MyTardis-Seq system, an extension to the well-estabilished MyTardis data management platform for Next-Generation Sequencing (NGS) data.

MyTardis with NGS Extension

MyTardis with NGS Extension

Information how MyTardis-Seq integrates with research data storage can be found here. A detailed architecture and workflow background can be found on the MyTardis page.

Big data mining market segmentation of ANZ Bank EFTPOS data

In Australia, the big 4 banks receive large amounts of Electronic Funds Transfer at Point of Sale (EFTPOS) transaction data on a daily basis, but despite this, this information-rich data are not stored nor analysed. The fact that EFTPOS data is both very large and very messy makes it difficult for banks themselves to gain visibility of the characteristics of the stakeholders of the data.

That changed in 2014, when a researcher in Monash’s Faculty of IT, Dr. Grace Rumantir, approached us for assistance in accessing/building a secure analysis environment for a data mining project on a collection of commercially sensitive EFTPOS data obtained through an award winning collaboration with the Australia and New Zealand Banking Group (ANZ). To our knowledge this is the first time market segmentation analyses have been applied to such a large amount of EFTPOS data anywhere in the world.

As a pilot, ANZ collated 5 months of EFTPOS transaction records, where all customer and retailer identifying data was redacted. Before this commercial in-confidence data could be released for research purposes, ANZ produced a list of comprehensive requirements pertaining to the secure storage and processing of the data. Securing the release of this data through ANZ Information Security protocol has been a lengthy and difficult process. The success was gained for the main part due to our team’s ability to demonstrate how we can very confidently meet these requirements with the infrastructure we have in place at Monash.

Our team very quickly built a workhorse but appropriately secure environment on R@CMon (specialist nodes due to the memory requirements for processing such a large dataset). The R@CMon environment already uses software defined virtualisation technology. We sandbox servers and R@CMon is housed in Monash’s own secure access facility. All ingress/egress access was locked down to allow only a few known clients (Grace and her research students). Remote desktop software and several data-mining tools of interest were configured for use by the researchers. The data (in daily csv samples) was stored in an encrypted volume file which was uploaded to a R@CMon volume attached to the analysis server. Individual passwords were used to unlock and mount the encrypted data, with a strict usage protocol to ensure the data remained locked when not in use. And so on.

A paper outlining our experience in acquiring, secured-storing and processing of the EFTPOS data can be found at:

Ashishkumar Singh, Grace Rumantir, Annie South, and Blair Bethwaite, Clustering Experiments on Big Transaction Data for Market Segmentation. In Proceedings of the 2014 International Conference on Big Data Science and Computing (BigDataScience ’14). ACM, New York, NY, USA, Article 16, DOI=http://dx.doi.org/10.1145/2640087.2644161

The market segmentation experiments on the retailers of the EFTPOS data involve reduction of the transaction data using the RFM (Recency, Frequency, Monetary) and clustering analysis with results indicating distinct combinations of RFM values of retailers in the clusters that could give the bank indications of different marketing strategies that can be applied to each of the retailer performance categories. This ground breaking revelation of the existence of retailer segments extracted from EFTPOS data has won Best Paper Award Industry Track at the Australasian Data Mining and Analytics Conference 2014.

Publication references:

Ashishkumar Singh, Grace Rumantir and Annie South, Market Segmentation of EFTPOS Retailers. In Proceedings of the 12th Australasian Data Mining Conference (AusDM 2014), Brisbane, Australia (http://ausdm14.ausdm.org/program) – Best Paper Award Industry Track

Ashishkumar Singh, Grace Rumantir. Two-tiered Clustering Classification Experiments for Market Segmentation of EFTPOS Retailers. Australasian Journal of Information Systems, [S.l.], v. 19, sep. 2015. ISSN 1449-8618. Available at: <http://journal.acs.org.au/index.php/ajis/article/view/1184>. Date accessed: 18 Oct. 2015. doi:http://dx.doi.org/10.3127/ajis.v19i0.1184.

This exciting result has been cited in the financial industry publications as an important example of how academia can help business gain insights into their own massive amount of data that can help them in making business decision.

On the success of this collaborative project, Patrick Maes, ANZ Chief Technology Officer, writes:

“The key here is to find the data scientists who can work with these models, a skill not easy to find nowadays”

(see http://www.itnews.com.au/news/me-bank-hires-data-boss-in-it-exec-restructure-411908 and https://bluenotes.anz.com/posts/2015/03/big-data-from-customer-targeting-to-customer-centric ).

On lessons learnt from this important pilot project, Dr. Grace Rumantir says:

“There is a long standing gap between what research in academia can offer and the needs in the industry. This gap takes the form of mistrust on the part of the people in the industry that academics may not deliver a solution that is relevant to their business on a timely manner. The results of this ground breaking project using EFTPOS data shows that we do understand what business needs and come up with a practical solution that business can directly translate into business strategies which can give them an edge in the competitive business environment.

We are able to do this with our ability to talk in the same wavelength with our industry clients, with our research skills in bleeding edge technology and with the support of the world class research support and infrastructure that Monash has been investing heavily on.”

 

Disruptive change in the clinical treatment of pancreatic cancer

Professor Jenkins’ research focuses on pancreatic cancer, an inflammation-associated cancer and the fourth most common cause of cancer death worldwide, with an extremely low 5% five-year survival rate. Typically studies look at gene expression patterns between normal pancreas and cancerous pancreas in order to identify unique signatures, which can be indicative of sensitivity or resistance to specific chemotherapeutic treatments.

“Using next generation gene sequencing, involving big instruments, big data and big computing – allows near-term disruptive change in the clinical treatment of pancreatic cancer.” Prof. Jenkins, Monash Health..

To date, gene expression studies have largely focused on samples taken from open surgical biopsy; a procedure known to be very invasive and only possible in 20% of pancreatic cancers. Prof Jenkins’ group, in collaboration with Dr Daniel Croagh from the Department of Upper Gastrointestinal and Hepatobiliary Surgery at Monash Medical Centre, recently trialled an alternative less invasive process available to nearly all pancreatic cancer patients known as endoscopic ultrasound-guided fine-needle aspirate (EUS-FNA) which uses a thin, hollow needle to collect the samples of cells from which genetic material can be extracted and analysed. The challenge then becomes to ensure gene sequencing from EUS-FNA samples is comparable to open surgical biopsy such that established analysis and treatment can be used.


Twenty-four EUS-FNA-derived genetic samples from normal and cancerous pancreas were sequenced at the MHTP Medical Genomics Facility producing a total amount of 40Gb of raw data. Those data were securely transferred onto R@CMon by the Monash Bioinformatics Platform for processing, statistical analysis and computational exploration using state-of-the-art Bioinformatics methods.

super_computer

Results thus far from this study show that data from EUS-FNA-derived samples were of high quality and also allowed the identification of gene expression signatures between normal and cancerous pancreas. Professor Jenkins’ group is now confident that EUS-FNA-derived material not only has the potential to capture nearly all of pancreatic cancer patients (compared to ~20% by surgery), but to also improve patient management and their treatment in the clinic.

“The current clinical genomics research space requires specialized high performance computational and storage infrastructure to support the processing and long term storage of those so-called “big data”. Thus R@CMon plays a major role in the discovery and development of new therapies and the improvement of Human health care in general.” Roxane Legaie, Senior Bioinformatician, Monash Bioinformatics Platform

 

The Digital Object Identifier (DOI) Minter on R@CMon

The Monash Digital Object Identifier (DOI) Minter was developed by the  ANDS-funded Monash University Major Open Data Collections (MODC) Project as an extendible service and deployed on the Monash node (R@CMon) of the NeCTAR Research Cloud for providing a persistent and unique identifier for datasets and research publications. A DOI is permanently assigned to datasets and publications to provide information about them, including where they or information about them can be found on the Internet. The DOI will not change even if information about the datasets changes over time.

Store.Synchrotron's data publishing form

Store.Synchrotron’s data publishing form using the Monash DOI minter service.

The Monash DOI Minter gives Monash University the ability to mint DOIs for data collections that are hosted and managed by services on R@CMon. The integration and accessibly to DOIs has never been easier. For instance the Monash Library can now use this service to mint DOIs for publicly accessible research collections.  But also it is now being utilised by the Australian Synchrotron’s Store.Synchrotron service, which manages data produced by the Macromolecular Crystallography (MX) beamline and streamlines DOI minting for datasets through a publication workflow.

Demo publication

A demo published collection on Store.Synchrotron.

An MX beamline user can now collect data on the beamline which is stored, archived and made accessible through the Store.Synchrotron service. When the researcher has publication quality data, a copy of this data is deposited in the Protein Data Bank (PDB), with the appropriate metadata. The new publication workflow allows researchers to publish data hosted by the Store.Synchrotron service, with PDB metadata being automatically attached to the datasets, and a DOI being minted and activated after a researcher-selected embargo period. The DOI reference can then be included in their research papers.

We think it is a brilliant pattern of play for accelerating persistent identifiers of research data held at universities. To this end, we have made the DOI Minter available for others to instantiate.

R@CMon announced as a Mellanox “HPC Center of Excellence”

At SuperComputing 2015 in Austin our network/fabric partner Mellanox announced R@CMon (Monash University) as a “HPC Centre of Excellence. A core goal of the HPC CoE is to drive the technological innovations required for the next generation (exascale) supercomputing, whilst also ensuring that such an exascale computer is relevant to modern research. R@CMon is a stand out pioneer at converging cloud, HPC and data, all of which are key to the “next generation”.

“We see Monash as a leader in Cloud and HPC on the Cloud with Openstack, Ceph and Lustre on our Ethernet CloudX platform.” Sudarshan Ramachandran, Regional Sales Manager, Australia & New Zealand

From a fabric innovation point of view, it has been a very productive and exciting 24months for R@CMon. By early 2014 the internal Monash University HPC system “MCC” was burst onto the Research Cloud, allowing a researcher’s own merit the be leveraged with institutional investment. It also represents a shift towards soft HPC, where the size of a HPC system changes regularly with time. Earlier this year we announced our early adoption of RoCE (RDMA over Converged Ethernet) using Mellanox technologies. The meant the same fabric used for cloud networking could also be used for HPC and data storage backplanes.  In turn MCC on the R@CMon also enabled RDMA communications, that is, real HPC performance but on an otherwise orchestrated cloud.

 

Finally at the Tokyo OpenSack summit 2015, Mellanox announced R@CMon as debuting the World’s first 100G End-to-End Cloud. This technology eases scaling and heterogeneity of performance aspects. In particular, it sets the basis for processor and storage performance for peak and converged cloud/HPC needs. Watch this space!