Tag Archives: Remote Desktop

Stock Price Impact Models Study on R@CMon Phase 2

Paul Lajbcygier, Associate Professor from the Faculty of Business and Economics, Monash University is studying one of the important changes that affects the cost of trading in financial markets. This change relates to the effects of trading to prices, known as “price impact”, which is brought by wide propagation of algorithmic and high frequency trading and augmented by technological and computational advances. Professor Lajbcygier’s group has recently published new results supported by R@CMon infrastructure and application migration activities, providing new insights into the trading behaviour of so-called “Flash Boys“.

This study uses datasets licensed from Sirca and represents stocks in the S&P/ASX 200 index from year range 2000 to 2014. These datasets are pre-processed using Pentaho and later ingested into relational databases for detailed analysis using advanced queries. Two NeCTAR instances on R@CMon have been used initially in the early stages of the study. One of the instances is used as the processing engine where Pentaho and Microsoft Visual Studio 2012 are installed for pre-processing and post-processing tasks. The second instance is configured as the database server where the extraction queries are executed. Persistent volume storage is used to store reference datasets, pre-processed input files and extracted results. A VicNode merit application for research data storage allocation has been submitted to support the computational access to the preprocessed data supporting the analysis workflow running on the NeCTAR Research Cloud.

Ingestion of pre-processed data into the database running on the high-memory instance, for analysis.

Ingestion of pre-processed data into the database running on the high-memory instance, for analysis.

Initially econometric analyses were done on just the lowest two groups of stocks in the S&P/ASX 200 index. Some performance hiccups were encountered when processing higher frequency groups in the index – some of the extraction queries, which require a significant amount of memory, would not complete when run on the exponentially higher stock groups. The release of R@CMon Phase 2 provided the analysis workflow the capability to attack the higher stock groups using a high-memory instance, instantiated on the new “specialist” kit. Parallel extraction queries are now running on this instance (close to 100% utilisation) to traverse the remaining stock groups from year range 2000 to 2014.

A recent paper by Manh Pham, Huu Nhan Duong and Paul Lajbcygier, entitled, “A Comparison of the Forecasting Ability of Immediate Price Impact Models” has been accepted for the “1st Conference on Recent Developments in Financial Econometrics and Applications”. This paper highlights the results of the examination of the lowest two groups of the S&P/ASX 200 index, i.e., just the initial results. Future research and publications include examination of the upper group of the index based on the latest reference data as they come available and analysis of other price impact models.

This is an excellent example of novel research empowered by specialist infrastructure, and a clear win for a build-it-yourself cloud (you can’t get a 920GB instance from AWS). The researchers are able to use existing and well-understood computational methods, i.e., relational databases, but at much greater capacity than normally available. This has the effect of speeding up initial exploratory work and discovery. Future work may investigate the use of contemporary data-intensive frameworks such as Hadoop + Hive for even larger analyses.

This article can also be found, published created commons here 1.

Spreadsheet of death

R@CMon, thanks to the Monash eResearch Centre’s long history of establishing “the right hardware for research”, prides itself on effectiveness at computing, orchestrating and storing for research. In this post we highlight an engagement that didn’t yield an “effectiveness” to our liking, and how that helped shape elements of the imminent R@CMon phase2.

In the latter part of 2013 the R@CMon team was approached by a visiting student working at the Water Sensitive Cities CRC. His research project involved parameter estimation for an ill-posed problem in ground-water dynamics. He had setup (perhaps partially inherited) an Excel spreadsheet based Monte Carlo engine for this, with a front-end sheet providing input and output to a built in VBA macro for the grunt work – an erm… interesting approach! This had been working acceptably in the small, as he could get an evaluation done within 24 hours on his desktop machine (quad core i7). But now he needed to scale up and run 11 different models, and probably a few times each to tweak the inputs.  Been there yourself?  This is a very common pattern!

Nash-Sutcliffe model efficiency

Nash-Sutcliffe model efficiency (Figure courtesy of eng. Antonello Mancuso, PhD, University of Calabria, Italy)

MCC (the Monash Campus Cluster), our first destination for ‘compute’, doesn’t have any Windows capability, and even if it did, attempting to run Excel in batch mode would have been something new for us. No problem we thought, we’ll use the RC, give him a few big Windows instances and he can spread the calculations across them. Not an elegant or automated solution for sure, but this was a one-off with some tight time constraints, so it was more important to start calculations than get bogged down with a nicer solution.

It took a few attempts to get Windows working properly. We eventually found the handy cloudbase solutions trial image and its guidance documentation. But we also ran into issues activating Windows against the Monash KMS, turns out we had to explicitly select our local network time source as opposed to the default time.windows.com. We also found some problems with the CPU topology that Nova was giving our guests, Windows was seeing multiple sockets rather than multiple cores, which meant desktop variants were out as they would ignore most of the cores.

Soon enough we had a Server 2012 instance ready for testing. The user RDP’d in and set the cogs turning. Based on the first few Monte Carlo iterations (out of the million he needed for each scenario) he estimated it would take about two days to complete a scenario, quite a lot slower than his desktop but still acceptable given the overall scale-out speed up. However, on the third day after about 60 hours compute time he reported it was only 55% complete. Unfortunately that was an unsustainable pace – he needed results within a fortnight – and so with his supervisor they resolved to code and use a different statistical approach (using PEST) that would be more amenable to cluster-computing.

We did some rudimentary performance investigation during the engagement and didn’t find any obvious bottlenecks, the guest and host were always very CPU busy, so it seemed largely attributable to the lesser floating point capabilities of our AMD Bulldozer CPUs. We didn’t investigate deeply in this case and no doubt there could be other elements at play here (maybe Windows was much slower for compute on KVM than Linux), but this is now a pattern we’ve seen with floating point heavy workloads across operating systems and on bare metal. Perhaps code optimisations for the shared FPU in the Bulldozer architecture can improve things, but that’s hardly a realistic option for a spreadsheet.

The AMDs are great (especially thanks to their price) for general purpose cloud usage, that’s why the RC makeup is dominated by them and why commercial clouds like Azure use them. But for R@CMon’s phase2 we want to cater to performance sensitive as well as throughput oriented workloads, which is why we’ve deployed Intel CPUs for this expansion. Monash joins the eRSA and NCI Nodes in offering this high-end capability. More on the composition of R@CMon phase 2 in the coming weeks!

Deakin Bioinformatics Workshop (February 17-19, 2014)

Last February 17-19, 2014, a bioinformatics workshop was held at Deakin University – Geelong Waterfront Campus. The workshop covered Genotype By Sequences (GBS) methodologies using various well known bioinformatics tools. The two main tools used in the workshop were Trait Analysis by aSSociation, Evolution and Linkage (TASSEL) and Bowtie. TASSEL is used to investigate relationships between phenotypes and genotypes while Bowtie is a tool used to align DNA sequences to the human genome.

20140218_134552_scaled

Trainees at the Deakin workshop, using the NeCTAR-provisioned training environment.

The workshop was delivered using the NeCTAR Research Cloud infrastructure and Bioplatforms Australia Training Platform. The R@CMon team supported the workshop organisers at Deakin University in creating a customised cloud image containing required tools and datasets as well as ensuring allocation of computational and storage resources in the cloud. The CloudBioLinux-based cloud image has been instantiated for each trainee, giving each one a dedicated virtual desktop environment for their analyses.

20140219_110953_scaled

Workshop trainers demonstrating Genotype By Sequences (GBS) methodologies and tools using a custom NeCTAR cloud image.

Feedback collected from participants on the day was overwhelmingly positive. However, some user-experience issues were encountered with the remote desktops (NX), those can be attributed to the network between the cloud servers (hosted on the eRSA Node in South Australia) and the participants. Though such issues haven’t shown up for BPA workshops utilising the Monash Node up-and-down the east coast, this demonstrates the importance of being able to reserve local cloud capacity for certain use-cases like this which are latency and jitter sensitive. Fortunately those issues were isolated and according to instructors from Cornell University, the training platform used in the workshop was one of the best they’ve used and the trainees were keen to attend future GBS-related workshops delivered using the cloud.

 

The CVL on R@CMon

The Characterisation Virtual Laboratory (CVL) is a powerful platform that integrates Australian imaging facilities with computational and data storage infrastructure, together with sophisticated processing and analysis toolsets. The CVL platform provides scientists working in various fields with a common analysis and collaboration environment, the CVL turns the humble remote desktop into a highly flexible Scientific Software as-a-Service delivery platform powered by the NeCTAR Research Cloud.

CVL-Desktop-01

The CVL Desktop

The current production CVL includes toolsets covering Neuroimaging, Energy Materials and Structural Biology research drivers. The project includes so-called “CVL fabric services”, which provide the necessary infrastructure to modularise popular software toolsets from any number of domains.

The R@CMon team assisted the CVL team in migrating CVL services into R@CMon. The use of persistent storage (Volumes on R@CMon) ensured consistent user home directories and software-stack repositories. The default “CVL Desktop” pool is now serving users with software-rendered CVL environments running on R@CMon. The CVL team is also a beta user of GPU flavours on R@CMon and is currently testing GPU-enabled CVL environments on the “CVL GPU node” pool (via CVL Launcher).

Screen Shot 2014-03-11 at 2.48.55 pm

The available pools on the CVL Launcher.

The following video demonstrates a GPU-enabled CVL environment launched on R@CMon. It shows the PyMOL and UCSF Chimera applications from the Structural Biology workbench, running and utilising the available GPU. The use of GPU enables seamless interaction and manipulation of datasets.

The plan is to increase the “CVL GPU node” pool to accommodate more users once GPU node capacity on R@CMon has been upgraded with deployment of R@CMon Phase 2. Watch this space for more CVL on R@CMon news. Other updates about the CVL and its sub-projects are also available on the CVL site.

Bioplatforms Australia – CSIRO Metagenomics Workshop (February 6-7, 10-11 2014)

Bioplatforms Australia and CSIRO conducted an “Introduction to Metagenomics” workshop last February 6-7, 2014 at University of New South Wales and February 10-11, 2014 at Monash University. The workshop was aimed for bench biologies with no or little experience in Bioinformatics using publicly available data resources and toolsets.

As per previous Bioplatforms Australia workshops, the Metagenomics workshop was delivered using the Monash node of the NeCTAR Research Cloud infrastructure – R@CMon. Cloud provisioning tools used in previous workshops have been reused to provide a seamless virtual desktop training platform.

The R@CMon team worked with Bioplatforms Australia, CSIRO and EMBL-EBI in producing an appropriate cloud image and toolset for the workshop. Some of the tools used in the workshop are QIIME, FastQC, and InterProScan.

Given the success and popularity of the training, R@CMon has begun work with Bioplatforms Australia and other nodes of the NeCTAR Research Cloud to scale the training environment with trainees as they progress from taking the course, to taking the training environment home with them, to preparing for production genomics facilities.

The trainees have the following to say about the 2-days workshop:

“Everyone is very helpful and the content was clear and concise. I thank everyone for getting this program up and running and I would definitely like to return for similar courses.”

“Definitely a fantastic & informative 2-days. I definitely feel that most of what I learned today can be directly applied to the molecular work I am currently engaged in.”

“Good amount of hands on to actually see what happens during the analysis of big data, most of what was taught was very clear and concise. good range of programs were used, only wish there was more time to go through more analyses.”

“I strongly recommend this as a complete course for students or beginners in NGS analyses.  i think the first metagenomic workshop is a winner.”

“Definitely this course has provided me a great basis to look at future data sets.”

“Well ran, good materials, virtual machine made life easy.”

The Visualising Angkor Project on R@CMon

The Visualising Angkor Project – “Visualising Angkor Project” Monash University Faculty of IT, 2013 was one of the main showcases during OzViz 2013 held last December 9-10 2013 at Monash University. Tom Chandler (project leader) and his team from Faculty of Information Technology used the NeCTAR Research Cloud to generate high-resolution visualisations for the CAVE2™ – the next-generation immersive 2D and 3D virtual reality environment, located at New Horizons, Monash University.

The Visualising Angkor Project

The Visualising Angkor Project in the CAVE2 facility.

The Maya/mental ray virtual render farm has been instrumental in producing 27K x 3K panoramic stills and animations for this project. This workflow has been proven very challenging for Tom and his team before they started using the NeCTAR Research Cloud for their rendering jobs.

Ricefields

A panoramic rendering of the Angkor surrounding fields, generated using the NeCTAR Research Cloud.

The resulting high-resolution stills and animations are loaded into the CAVE2™ environment using advanced visualisation software frameworks. This provides a compelling visual and aural environment with a 330° display view – the lens of the 21st century microscope.

Ricefields

Rice fields surrounding the Angkor.

In 2014 the R@CMon and CAVE2™ teams will work together to build a CAVE2™ development environment on the NeCTAR Research Cloud. This will give end-users the opportunity to work with and test the tools and middleware available in the CAVE2™ environment on-demand, without needing access to the facility itself. This development image will take advantage of R@CMon’s new GPGPU accelerated VM flavors – more on that soon!

 

Bioinformatics Training on R@CMon

A multidisciplinary partnership between Monash eResearch Centre and Bioplatforms Australia has provided a broadly accessible solution to delivering hands-on bioinformatics workshops with seamless access to cloud computing using the new NeCTAR Research Cloud infrastructure.

Running hands-on bioinformatics workshops in Australia has previously been hampered by the lack of specialised bioinformatics training facilities and a paucity of skilled trainers to develop and deliver these courses.

To improve the bioinformatics skills of bench scientist now faced with handling gigabyte size datasets generated by next-generation sequencing technologies, Bioplatforms Australia and CSIRO have been collaborating to advance bioinformatics expertise among Australian ‘omics researchers.  Through an international partnership with the EMBL European Bioinformatics Institute in the UK, a cutting-edge three-day Australian hands-on NGS workshop has been created. This course introduces bench scientists to quality control of NGS data, alignment, ChIPSeq, RNASeq and de novo assembly workflows and software.

Professor Paul Bonnington, Director of the Monash eResearch Centre and the R@CMon team contributed to this training initiative through the development of a cloud computing-based NGS bioinformatics training platform based on the open source bioinformatics software package CloudBioLinux. The platform allows sharing of data, tools and applications and enables trainers anywhere in the world to readily work together to develop and test new workshop material.

The Bioplatforms Australia Next Generation Sequencing workshop platform enables compute-intensive NGS training courses to be easily delivered and accessed widely around Australia and requires very little local IT expertise or need for high end computational hardware. The first hands-on workshops using the NGS workshop platforms were held in July 2012 at Monash University in Melbourne and at University of New South Wales in Sydney. To date 10 workshops have been delivered around Australia in Melbourne, Sydney, Brisbane, Adelaide, Perth and Canberra to 345 trainees.

The team of trainers from Bioplatforms Australia and CSIRO in collaboration with EBI-EMBL and Monash eResearch Centre are currently developing a metagenomics 2-day workshop using a bespoke metagenomics image built by R@CMon. This course will be run at University of New South Wales in Sydney on the 6-7 February 2014 and at Monash University on the 10-11th February 2014.

Monash University is one of the nodes of NeCTAR‚ a research cloud platform, a landmark investment that will extend the advantages of high-performance computing and high capacity networks to Australian researchers. This exciting initiative provides on-line access to scalable computational power and data storage allowing a new realm of data sharing and collaboration.

The Bioplatforms Australia Next Generation Sequencing workshop platform is now freely accessible on the NeCTAR research cloud and provides access to hundreds of bioinformatics software packages.

Further information on the Bioplatforms Australia Next Generation Sequencing workshop platform is available from Catherine Shang, Bioplatforms Australia at cshang@bioplatforms.com. Contact Prof. Paul Bonnington, MeRC Director for further details and assistance on e-research solutions.