Author Archives: Blair Bethwaite

Big data mining market segmentation of ANZ Bank EFTPOS data

In Australia, the big 4 banks receive large amounts of Electronic Funds Transfer at Point of Sale (EFTPOS) transaction data on a daily basis, but despite this, this information-rich data are not stored nor analysed. The fact that EFTPOS data is both very large and very messy makes it difficult for banks themselves to gain visibility of the characteristics of the stakeholders of the data.

That changed in 2014, when a researcher in Monash’s Faculty of IT, Dr. Grace Rumantir, approached us for assistance in accessing/building a secure analysis environment for a data mining project on a collection of commercially sensitive EFTPOS data obtained through an award winning collaboration with the Australia and New Zealand Banking Group (ANZ). To our knowledge this is the first time market segmentation analyses have been applied to such a large amount of EFTPOS data anywhere in the world.

As a pilot, ANZ collated 5 months of EFTPOS transaction records, where all customer and retailer identifying data was redacted. Before this commercial in-confidence data could be released for research purposes, ANZ produced a list of comprehensive requirements pertaining to the secure storage and processing of the data. Securing the release of this data through ANZ Information Security protocol has been a lengthy and difficult process. The success was gained for the main part due to our team’s ability to demonstrate how we can very confidently meet these requirements with the infrastructure we have in place at Monash.

Our team very quickly built a workhorse but appropriately secure environment on R@CMon (specialist nodes due to the memory requirements for processing such a large dataset). The R@CMon environment already uses software defined virtualisation technology. We sandbox servers and R@CMon is housed in Monash’s own secure access facility. All ingress/egress access was locked down to allow only a few known clients (Grace and her research students). Remote desktop software and several data-mining tools of interest were configured for use by the researchers. The data (in daily csv samples) was stored in an encrypted volume file which was uploaded to a R@CMon volume attached to the analysis server. Individual passwords were used to unlock and mount the encrypted data, with a strict usage protocol to ensure the data remained locked when not in use. And so on.

A paper outlining our experience in acquiring, secured-storing and processing of the EFTPOS data can be found at:

Ashishkumar Singh, Grace Rumantir, Annie South, and Blair Bethwaite, Clustering Experiments on Big Transaction Data for Market Segmentation. In Proceedings of the 2014 International Conference on Big Data Science and Computing (BigDataScience ’14). ACM, New York, NY, USA, Article 16, DOI=http://dx.doi.org/10.1145/2640087.2644161

The market segmentation experiments on the retailers of the EFTPOS data involve reduction of the transaction data using the RFM (Recency, Frequency, Monetary) and clustering analysis with results indicating distinct combinations of RFM values of retailers in the clusters that could give the bank indications of different marketing strategies that can be applied to each of the retailer performance categories. This ground breaking revelation of the existence of retailer segments extracted from EFTPOS data has won Best Paper Award Industry Track at the Australasian Data Mining and Analytics Conference 2014.

Publication references:

Ashishkumar Singh, Grace Rumantir and Annie South, Market Segmentation of EFTPOS Retailers. In Proceedings of the 12th Australasian Data Mining Conference (AusDM 2014), Brisbane, Australia (http://ausdm14.ausdm.org/program) – Best Paper Award Industry Track

Ashishkumar Singh, Grace Rumantir. Two-tiered Clustering Classification Experiments for Market Segmentation of EFTPOS Retailers. Australasian Journal of Information Systems, [S.l.], v. 19, sep. 2015. ISSN 1449-8618. Available at: <http://journal.acs.org.au/index.php/ajis/article/view/1184>. Date accessed: 18 Oct. 2015. doi:http://dx.doi.org/10.3127/ajis.v19i0.1184.

This exciting result has been cited in the financial industry publications as an important example of how academia can help business gain insights into their own massive amount of data that can help them in making business decision.

On the success of this collaborative project, Patrick Maes, ANZ Chief Technology Officer, writes:

“The key here is to find the data scientists who can work with these models, a skill not easy to find nowadays”

(see http://www.itnews.com.au/news/me-bank-hires-data-boss-in-it-exec-restructure-411908 and https://bluenotes.anz.com/posts/2015/03/big-data-from-customer-targeting-to-customer-centric ).

On lessons learnt from this important pilot project, Dr. Grace Rumantir says:

“There is a long standing gap between what research in academia can offer and the needs in the industry. This gap takes the form of mistrust on the part of the people in the industry that academics may not deliver a solution that is relevant to their business on a timely manner. The results of this ground breaking project using EFTPOS data shows that we do understand what business needs and come up with a practical solution that business can directly translate into business strategies which can give them an edge in the competitive business environment.

We are able to do this with our ability to talk in the same wavelength with our industry clients, with our research skills in bleeding edge technology and with the support of the world class research support and infrastructure that Monash has been investing heavily on.”

 

The CVL on R@CMon Phase 2

Monash is home to the Multi-modal Australian ScienceS Imaging and Visualisation Environment (MASSIVE), a national facility for the imaging and characterisation community. An important and rather novel feature of the MASSIVE compute cluster is the interactive desktop visualisation environment available to assist users in the characterisation process. The MASSIVE desktop environment provided part of the inspiration for the Characterisation Virtual Laboratory (CVL), a NeCTAR VL project combining specialist software visualisation and rendering tools from a variety of disciplines and making them available on and through the NeCTAR research cloud.

The recently released monash-02 zone of the NeCTAR cloud provides enhanced capability to the CVL, bringing a critical mass of GPU accelerated cloud instances. monash-02 includes ten GPU capable hypervisors, currently able to provide up to thirty GPU accelerated instances via direct PCI passthrough. Most of these are NVIDIA GRID K2 GPUs (CUDA 3.0 capable), though we also have one K1. Special thanks to NVIDIA for providing us with a couple of seed units to get this going and supplement our capacity! After consultation with various users we created the following set of flavors/instance-types for these GPUs:

Flavor name#vcoresRAM (MB)/dev/vda (GB)/dev/vdb (GB)
mon.r2.5.gpu-k21540030N/A
mon.r2.10.gpu-k22108003040
mon.r2.21.gpu-k242170030160
mon.r2.63.gpu-k2126500030320
mon.r2.5.gpu-k11540030N/A
mon.r2.10.gpu-k12108003040
mon.r2.21.gpu-k142170030160
mon.r2.63.gpu-k1126500030320

R@CMon has so far dedicated two of these GPU nodes to the CVL, and this is our preferred method for use of this equipment, as the CVL provides a managed environment and queuing system for access (regular plain IaaS usage is available where needed). There were some initial hiccups getting the CVL’s base CentOS 6.6 image working with NVIDIA drivers on these nodes, solved by moving to a newer kernel, and some performance tuning tasks still remain. However, the CVL has now been updated to make use of the new GPU flavors on monash-02, as demonstrated in the following video…

GPU-accelerated Chimera application running on the CVL, showing the structure of human follicle-stimulating hormone (FSH) and its receptor.

If you’re interested in using GPGPUs on the cloud please contact the R@CMon team or Monash eResearch Centre.

Spreadsheet of death

R@CMon, thanks to the Monash eResearch Centre’s long history of establishing “the right hardware for research”, prides itself on effectiveness at computing, orchestrating and storing for research. In this post we highlight an engagement that didn’t yield an “effectiveness” to our liking, and how that helped shape elements of the imminent R@CMon phase2.

In the latter part of 2013 the R@CMon team was approached by a visiting student working at the Water Sensitive Cities CRC. His research project involved parameter estimation for an ill-posed problem in ground-water dynamics. He had setup (perhaps partially inherited) an Excel spreadsheet based Monte Carlo engine for this, with a front-end sheet providing input and output to a built in VBA macro for the grunt work – an erm… interesting approach! This had been working acceptably in the small, as he could get an evaluation done within 24 hours on his desktop machine (quad core i7). But now he needed to scale up and run 11 different models, and probably a few times each to tweak the inputs.  Been there yourself?  This is a very common pattern!

Nash-Sutcliffe model efficiency

Nash-Sutcliffe model efficiency (Figure courtesy of eng. Antonello Mancuso, PhD, University of Calabria, Italy)

MCC (the Monash Campus Cluster), our first destination for ‘compute’, doesn’t have any Windows capability, and even if it did, attempting to run Excel in batch mode would have been something new for us. No problem we thought, we’ll use the RC, give him a few big Windows instances and he can spread the calculations across them. Not an elegant or automated solution for sure, but this was a one-off with some tight time constraints, so it was more important to start calculations than get bogged down with a nicer solution.

It took a few attempts to get Windows working properly. We eventually found the handy cloudbase solutions trial image and its guidance documentation. But we also ran into issues activating Windows against the Monash KMS, turns out we had to explicitly select our local network time source as opposed to the default time.windows.com. We also found some problems with the CPU topology that Nova was giving our guests, Windows was seeing multiple sockets rather than multiple cores, which meant desktop variants were out as they would ignore most of the cores.

Soon enough we had a Server 2012 instance ready for testing. The user RDP’d in and set the cogs turning. Based on the first few Monte Carlo iterations (out of the million he needed for each scenario) he estimated it would take about two days to complete a scenario, quite a lot slower than his desktop but still acceptable given the overall scale-out speed up. However, on the third day after about 60 hours compute time he reported it was only 55% complete. Unfortunately that was an unsustainable pace – he needed results within a fortnight – and so with his supervisor they resolved to code and use a different statistical approach (using PEST) that would be more amenable to cluster-computing.

We did some rudimentary performance investigation during the engagement and didn’t find any obvious bottlenecks, the guest and host were always very CPU busy, so it seemed largely attributable to the lesser floating point capabilities of our AMD Bulldozer CPUs. We didn’t investigate deeply in this case and no doubt there could be other elements at play here (maybe Windows was much slower for compute on KVM than Linux), but this is now a pattern we’ve seen with floating point heavy workloads across operating systems and on bare metal. Perhaps code optimisations for the shared FPU in the Bulldozer architecture can improve things, but that’s hardly a realistic option for a spreadsheet.

The AMDs are great (especially thanks to their price) for general purpose cloud usage, that’s why the RC makeup is dominated by them and why commercial clouds like Azure use them. But for R@CMon’s phase2 we want to cater to performance sensitive as well as throughput oriented workloads, which is why we’ve deployed Intel CPUs for this expansion. Monash joins the eRSA and NCI Nodes in offering this high-end capability. More on the composition of R@CMon phase 2 in the coming weeks!

Ceph Enterprise – a disruptive period in the storage marketplace

Recently R@CMon signed an agreement for Inktank Ceph Enterprise (aka ICE). Inktank is an open-source development and professional services company that spun out of DreamHost two years ago. And boy have they have been pretty busy since then – not only building an international open-source community around Ceph, but also solidifying the product through several major releases in conjunction with exploring enterprise support business models. It’s both the innovation rate and the professional maturity that drew us to Ceph. The next release of ICE (due this month) includes support for Erasure Coding (think distributed RAID) and cache-tiering (think SSD performance for near spinning-disk cost)!

The R@CMon – Ceph journey began in September last year where we introduced block storage based on Ceph to the NeCTAR Research Cloud. Others have been paying attention too and now several other NeCTAR Nodes (NCI, TPAC and QCIF so far) are using Ceph to meet their block storage needs. Monash now has just shy of 400TB raw capacity colocated with the monash-01 compute zone/cell and another 500TB coming with monash-02 deployment. With other developments in the pipeline we expect to have well over 1PB of storage managed by Ceph by the end of the year!

One of the things that came with the release of ICE was a closed-source management tool, Calamari, which provides the sort of graphical status, monitoring and configuration dashboard you might expect from “enterprise” software (how that term makes us cringe!). Calamari is starting to look pretty slick and is certainly one piece of the puzzle transitioning a storage solution adopted and built in the eResearch Healthy Hot-House environment to a robust service delivery practice – getting woken up at 2am in the morning if a storage node dies is not in my contract!

Ceph Calamari GUI

Calamari OSD Workbench

Not long after we’d signed up for ICE, Inktank was acquired by Red Hat for a cool $175mil. This announcement was surprising in that it was unexpected – Red Hat already has an iron in the software-defined storage fire in Red Hat Storage Server (GlusterFS wearing a fedora). But once digested this appears to be a very astute move by Red Hat as Ceph has been the sweetheart of storage for OpenStack deployers, thanks largely to its ability to converge object/block/filesystem, something that took the NAS-oriented Gluster a while to catch up on. It seems Red Hat now have a firm grip on software-defined storage!

Red Hat’s acquisition of Inktank is good news for us as it will ultimately mean a supported version of Ceph on an enterprise Linux distribution we have a broad skills base in, with local technical support personnel to back it all up, and in a much shorter timeframe than Inktank could have delivered on its own. And true to their upstream-first philosophy, Red Hat has already open-sourced Calamari.

It’s an exciting and disruptive period in the storage market!

Cloud to your taste?!

Executive Summary


Now after 500 allocated projects (excluding trials) over two years of the RC, we propose an evolution in the NeCTAR Research Cloud (herein “RC”) flavor offerings, to:

  • introduce new RC-wide classes of compute and memory optimised flavors tailored more specifically to particular workloads;
  • have standard/base flavors more aligned with the commercial cloud alternatives;
  • give the RC Nodes greater flexibility in establishing utilisation-improvement strategies.

We believe this approach will provide immediate utilisation gains without memory overcommitment, whilst still providing scope for Nodes to pursue further overcommitment/consolidation as may be appropriate for their workloads and architecture. The remainder of this document details the background and proposes an RC-wide implementation strategy and some example Node deployment scenarios. Skip to Core recommendations below for the proposal.

Overview

As the RC continued to gain popularity and new users over the course of 2013 we got our first taste of what happens once cloud Nodes (capital ‘N’ indicating a zone) get full. The operational Nodes have also had a chance to take stock of the new paradigm and look for opportunities to improve ROI and outcomes for their users. We are now at a juncture where it is appropriate to revisit and refine the basic RC menu – the virtual machine hardware footprints (commonly known as flavors or instance-types) which make up the primary resource consumption patterns of the RC.

With the capacity pressure Nodes have discussed and (some) implemented resource overcommitment (i.e., allocating more virtual machine memory and/or CPU cores than the physical host actually has). This is possible thanks to various kernel and hypervisor technology, and valuable because instances generally do not use all their assigned memory or CPU resources at any one time. Whilst a notional level of overcommitment is likely to be safe and even seems to be quite common in private cloud deployments, the technique is a blunt tool with many caveats including: non-standardised adoption in the RC; general unpopularity amongst the end-user community; and perhaps most significantly, an element of inherent risk due to the fact that unmitigated memory pressure can result in OOM’d instances or crashed systems. A complementary strategy, requiring less overcommitment to achieve the same utilisation, is to fit flavors to workloads. This will (hopefully!) result in greater workload consolidation and empower users to select the best virtual hardware to match their application/s.

The general idea of this proposal is to further develop the RC flavors to provide variety of configuration and QoS. We assume CPU overcommitment for general-purpose/standard flavors, this is a standard practice in virtualised environments and experience to date shows ample headroom for doing so in the RC context. Importantly, we propose to use recent middleware features (cgroups CPU share quotas) to limit CPU contention and guarantee relative CPU QoS for different flavors, whilst also specifying a class of compute flavors with no overcommitment recommended.

Motivations

  • High demand and slow capacity ramp up.
  • Improve efficiency.
  • Current flavors do not offer any variety of CPU-to-RAM ratio and so don’t cater specifically to the many and varied workloads of the RC, this translates to missed opportunities for consolidation and potential for higher utilisation.
  • Current flavors do not align with commercial offerings – other on-shore IaaS providers (AWS, Microsoft Azure, Rackspace) have already done the work of determining good typical offerings. We should leverage this and provide standard flavors more comparable to the rest of the market whilst also noting that particular research computing requirements can be met with other flavor classes.
  • Operational experience with the existing flavors shows significant opportunity for improvement of hardware utilisation (typically loads of free CPU and plenty of memory even when a host is “full” in terms of vCPUs allocated to it).
  • Provide wider scope for HTC workloads to a “use-up” idle CPU capacity.
  • The automatically allocated ‘pt-*’ projects have been a very popular and extremely useful tool in on-boarding users, however as a no-barrier free offering they are very generous and there is no incentive for users to release the resources before the project end. The ultimate goal of providing a simple method to test-drive the RC can be achieved in a lighter footprint whilst giving the user more flexibility.
  • Since the RC’s inception, new features in successive releases of OpenStack have added capabilities which can help us achieve and manage a diversified offering.

What to do?

Considerations

We propose revamping the RC flavors/instance-types based on:

  • The typical nova-compute hardware footprint (1 “core” / 4 GB RAM) and the range of hardware already in the wild and ordered, from 2P x 12″core” AMD to 4P x 16″core” Intel. Related to this, see the box-packing problem in Other considerations below.
  • The standard offerings of commercial providers operating on-shore (competitors / partners / burst-locations).
  • A desire to cater more directly to a variety of use-cases with differing compute/memory requirements, e.g., web servers/services, data services (presentation, transformation, capture), remote desktops (e.g., training or SaaS), data-mining, high-throughput compute, loosely coupled high-performance compute.
  • A desire to improve hardware utilisation allowing a greater volume and variety of workloads to co-exist on the RC. And the assumption that flavor variety in and of itself will translate to improved utilisation without the temptation for aggressive resource overcommitment.
  • The assumption that reliability and consistency are intricately linked, and both should trump overall hardware utilisation.
  • Given limited resource-quotas users will tend to fit their workloads into the best perceived flavor match, with the incentive that it leaves quota available for them to start new instances.
  • Nodes must be able to flexibly define their own specific private flavors.
  • Any attempt to standardise CPU and/or IO performance for the various flavors across Nodes is considered out-of-scope due to different hardware and storage architectures already deployed.
  • Nodes must have flexibility in how they choose to implement these recommendations based on the specific needs of their local user communities and their operational ability/agility.

Core recommendations

  1. Creation of three new RC-wide flavor classes: standard (“m2”, following on from “m1”), compute (“c2”), memory (“r2”, r for RAM). The compute and memory flavors are referred to as optimised instances. Class is defined purely as the naming prefix from the user’s perspective (whether they differ with respect to overcommit or resource share in the back-end is the discretion of the Node).
    • The compute class is used to calibrate CPU share values (as per https://wiki.openstack.org/wiki/InstanceResourceQuota) in order to maintain an approximate 1-to-1 vCPU to core/thread/module ratio for compute workloads. Flavors with CPU shares less than 2048 represent a fractional share, noting that share values only come into effect with CPU contention, so such flavors can still get all clock cycles when they are available.
    • compute and memory flavors are offset +/-33% from the current 1 core / 4GB RAM.
  2. Adjust standard root ephemeral (vda) size to be more accommodating of larger snapshots and OSes like Windoze. 30GB vda as standard with secondary ephemeral (vdb) adjusted down or removed completely to accommodate.
  3. Proposed new flavors:

    Flavors 2.0

    ClassName#vCoresRAM size (GB)primary/vda disk size (GB)secondary/vdb (min) disk size (GB)cpu.shares
    standardm2.tiny1768(MB)5N/A256
    m2.xsmall1210N/A512
    m2.small1430N/A1024
    m1.small1410301024
    m2.medium2630N/A2048
    m1.medium2810602048
    m2.large41230804096
    m1.large416101204096
    m1.xlarge832102408192
    m2.huge12483036012288
    m1.xxlarge16641048016384
    optimisedc2.112.630N/A2048
    c2.225.230N/A4096
    c2.8820.83016016384
    c2.161641.63032032768
    r2.515.330N/A2048
    r2.10210.630404096
    r2.21421.2301608192
    r2.631263.53032024576
  4. Standard flavors are available to all, whilst compute and memory flavors will be made available to projects that justify a technical requirement for them. Access to these optimised flavor classes would then be granted in the normal project creation/amendment workflow.
  5. Refinements to the allocation review process that involve technical appraisal of a project’s need for optimised flavors. This also provides scope for the target Node (if any) to evaluate the request relative to availability of optimised resources.
  6. pt’s (project trial accounts) have a 4GB RAM & 4 vCPU quota limit. Thus limiting them to the m2.tiny, m2.xsmall, m2.small, m2.medium and m1.small flavors, yet they can now run up to four m2.tiny instances.
  7. Nodes may choose to overcommit CPU and/or memory resources underpinning these proposed flavor classes, however we do not recommend overcommitment of optimised flavors until operational experience is gained with the new flavor classes. It should be noted that we expect significant utilisation improvements with the new standard flavors even without overcommitment. Additionally, Nodes will need to carefully consider their public IPv4 address capacity.
  8. Nodes use nova host-aggregates for flavor classes and each individual flavor, allowing class-based and fine-grained per flavor control of instance-types that can be scheduled to any particular nova-compute hypervisor/node.
  9. Nodes overcommitting the standard flavor class use nova per-aggregate core and memory allocation ratio settings (see the relevant blueprint and filter scheduler docs for details). This will allow, e.g., nova-compute nodes in the same Cell to be dedicated to overcommitted standard flavors whilst other nova-compute nodes service optimised flavors. By using host-aggregates in this manner the distribution of compute nodes can be changed dynamically, this is not possible with per-Cell allocation ratios (unless the node is “drained”, removed from production, and then reconfigured).

Other considerations/notes

  • The box-packing problem – unbalanced allocation of CPU and memory resources on a hypervisor, i.e., all CPUs allocated but memory available and vice-a-versa (this was one of the primary drivers for a fixed ratio of resources in the “m1” flavors).
    • We assume higher CPU utilisation is the priority in terms of consolidation as memory is cheaper and there is always available HTC workload that can utilise spare CPU.
    • Because we assume CPU overcommitment for the standard flavors and have introduced additional lower memory footprint flavors, this problem is diminished somewhat in the standard flavors.
    • Pathological scheduling with the optimised flavors could result in a hypervisor being entirely filled by “r2” instances, thus resulting in 33% spare CPU cores, however the average case (i.e., assuming even distribution of compute and memory instances) would see no change.
  • Specialised capability flavors (e.g., GPU, very high-memory, high-IOPS) should be standardised where possible. A potential example is for GPU flavors, there are only a couple of GPU types likely to be deployed across the RC, these options could be combined with the memory and compute optimised flavor configurations to produce new ‘mg2’ and ‘cg2’ flavor classes.
  • Public IPv4 address capacity. One of the obvious issues when considering fitting more instances into the existing hardware footprint is whether we have enough IPv4 addresses to match our hardware capacity, and this is one reason resource overcommitment should be very carefully managed. With the existing flavor menu the average instance is 2.5 cores, so that’s 4 public IPv4 addresses needed for every 10 cores, there’s good reason to believe this will gap will shrink if these recommendations are implemented. (At Monash we’re making provisions to “borrow” another /21 until we can deploy SDN)
  • The box-“filling” problem. Particularly with no spaces available for large memory VMs. Possible solution – implement a new scheduler filter that will hard-limit the number of instances of any particular flavor running on a compute node, so as to guarantee a minimum capacity of certain flavors
  • Instigate a process whereby the utilisation of compute and memory instances is monitored on a per-project basis, projects determined not to be making effective use of the optimised capabilities may lose access to those flavor classes.

Node Deployment Examples and Trade Offs

These are illustrative scenarios designed to show the possibilities afforded by adoption of these recommendations, many of these scenarios could be mixed and matched within a Node’s deployment in order to achieve the Node’s goals.

  1. All flavors can run on any compute node. (At a certain aggregate usage level no compute nodes would be able to launch any new larger memory instances, despite there being plenty of aggregate free memory capacity. Can be somewhat mitigated by use of a fill-first scheduling approach).
  2. Standard flavors separate from optimised (compute and memory) flavors, whilst the latter two coexist on the same hardware. Standard flavors overcommitted, optimised flavors not overcommitted.
  3. As for #2, standard flavors separate from optimised (compute and memory) flavors. Standard flavors on AMD based compute nodes, compute and memory optimised flavors on Intel based compute nodes.
  4. Separate compute nodes for, e.g., all flavors with >30GB RAM (independent of flavor class), whilst remaining compute nodes can run any flavor. This provides a means to guarantee a certain capacity of high-memory instances.
  5. Mix of #3 and #4.

Discussion

None yet – will update with notable comments.

Best Practices for Overcommit

The following is intended to seed a set of living suggestions/guidelines to help Nodes implement resource overcommitment in the safest possible manner, so as not to compromise stability of user services (it must be remembered that positive sentiment is hard to win and easy to lose, people are much more likely to share poor experiences widely than they are to rave about what great infrastructure we’re providing them with).

  • If you have local direct-attached storage for your nova-compute ephemeral disks then use a the io_ops scheduler filter to ensure that instance-spawning does not overwhelm your host’s disks. Also set upper disk IOPS and throughput limits on the various flavor classes so that noisy neighbour effects are limited.
  • Make sure you have enough swap space (plus a generous amount extra for migrations) on your nova-compute nodes to hold overflow memory pages of instances if they all suddenly started trying to use their full RAM footprint (we really don’t want to see the OOM Killer).
  • Make sure your swap device/file is in some way fenced from the rest of your disk IO concerns so that hypervisor-level swapping does not contend with instance IO, e.g., separate swap device or file on the root block devices.
  • Use cgroups to reserve memory for the host.
  • Monitor and test relevant new Kernel features such as zswap.

ATAQ (Answers To Anticipated Questions)

Q: Why not mix overcommitted and non-overcommitted instances on the same compute node, e.g., have host-aggregates with different ram_allocation_ratios assigned to a single compute node.

A: Whilst we can assign multiple aggregates, the Aggregate*Filter scheduler filters will pick and use the smallest *_allocation_ratio from all relevant aggregates the node is a member of. There’s a good reason for this – it just doesn’t make sense to do anything else. Because the *_allocation_ratio settings are a soft accounting metric, i.e., they only affect the scheduling decision of whether a node can support a new instance, they have no effect on the absolute or relative share of resources each instance has on any particular node. E.g., if the scheduler applied different ram_allocation_ratios on a single node based on the flavor of instances launched, the effective ram_allocation_ratio would end up as something of a weighted average and all instances on the node would be equally affected by ram contention.

Q: What’s with the names / why don’t we copy AWS names?

A: There appears to be no good naming convention to mimic (probably due to English lacking specific and unambiguous words to describe size in various dimensions). AWS have made a hash of their own convention as their flavor offerings have proliferated, it’s now virtually impossible to even guess what kind of instance you’re looking at, e.g., cr1.8xlarge is apparently a memory optimised instance – go figure! Rackspace simply use the memory size as the name, which is not a bad idea but breaks if you want different vCPU counts with the same memory size. Azure uses A1 – A7, which is possibly the single most most intuitive aspect of Azure since it was released (clearly marketing wasn’t involved with that ;-). We’ve made suggestions which continue the tradition of somewhat ambiguous descriptors for the standard flavors but give more precise names to the compute and memory flavors which hopefully speak for themselves. A previous draft used Aussie beer sizes for the standard flavors: pony, small, pot, halfpint, pint, jug, partykeg…

Q: Why the strange/not-round memory sizes for the optimised flavors versus the standard?

A: To account for virtualisation overheads on the host (page tables, block cache, etc.). Take a browse through the AWS and Azure offerings and you’ll see they do this too. The reason for the discrepancy is because we do not expect much if any overcommitment with the optimised flavors, whereas we believe it will be commonplace for conservative overcommitment with the standard flavors (so there’s little reason to adjust for the virt overheads of these).

Parking Lot

Ideas / pipe-dreams go here (mostly requiring dev and ops work)…

  • Introduce a spot-instance flavor class and modify the filter scheduler to introduce a retry-style filter that iteratively removes running spot-instance metrics (in batches of some configurable spot-instance scheduling time-slice) from the scheduling meter data until running spot-instances are entirely ignored by the scheduler (thus logically preempted). The final scheduling outcome may then include a linked action to terminate some spot-instance/s on the chosen compute node.
  • Extend the existing Nova resource quota feature to set memory-cgroup parameters. Judicious use of the limit_in_bytes and soft_limit_in_bytes controls (particularly the latter) could be used to mix standard and memory optimised instances on the same compute node, e.g., a standard instance launched in a 1.5 ram_allocation_ratio aggregate would have a corresponding memory soft_limit = virtual_ram_size / 1.5, whereas a memory instance in a 1.0 ram_allocation_ratio aggregate would have memory soft_limit = virtual_ram_size.
  • Use containers (e.g., lxc) to fence/partition relevant resources on the compute nodes such that overcommit can be implemented consistently amongst classes of flavor. This might mean splitting a physical compute node into multiple compute containers, each acting as a nova-compute node.
  • Implement a new scheduler filter that will hard-limit the number of instances of any particular flavor running on a compute node – a possible work-around for the box-packing problem.