Tuesday, September 21, 2010

A run on your cloud?

When I use a bank, I'm fully aware that the statement I receive is just a set of digits outlining an agreement of how much money I have or owe. In the case of savings, this doesn't mean the bank has my money in a vault somewhere as in all likelihood it's been lent out or used elsewhere. The system works because a certain amount of reserve is kept in order to cover financial transactions and an assumption is made that most of my money will stay put.

Of course, as soon as large numbers of people try to get their money out, it causes a run on the bank and we discover just how little the reserves are. Fortunately, in the UK we have an FSA scheme to guarantee a minimum amount that will be returned.

So, what's this got to do with cloud? Well, cloud (as with banking) works on a utility model, though in the case of banking we get paid on both the amount we consume and provide (i.e interest) and in the cloud world we normally only have the option to consume.

In the case of infrastructure service providers, there are no standard units (i.e. there is no common cloud currency) but instead each provider offers it own range of units. Hence if I rent a thousand computer resource units, those units are defined by that provider as offering a certain amount of storage and CPU for a given level of quality at specified rate (often an hourly fee).

As with any utility there is no guarantee that when I want more, the provider is willing to offer this or has the capacity to do so. This is why the claims of infinite availability are no more than an illusion.

However, hidden in the depths of this is a problem with transparency which could cause a run on your cloud in much the same way that Credit Default Swaps hit many financial institutions as debt exceeded our capacity to service it.

When I rent a compute resource unit from a provider, I'm working on the assumption that what I'm getting is that compute resource unit and not some part of it. For example, if I'm renting on an hourly basis a 1Ghz core with 100Gb storage and 2Gb memory - I'm expecting exactly that.

However, I might not use the whole of this compute resource. This offers the service provider, if they were inclined, an opportunity to sell the excess to another user. In this way, a service provider running on a utility basis could be actively selling 200 of their self defined compute units to customers whilst it only has the capacity to provide for 100 of those units when fully used. This is quaintly given terms like improving utilisation or overbooking or oversubscription but fundamentally it's all about maximising service provider margin.

The problem occurs when everyone tries to use their compute resources fully with an overbooked provider, just like everyone trying to get their money out of a bank. The provider is unable to meet its obligations and partially collapses. The likely effect will be compute units being vastly below their specification or some units which have been sold are thrown off the service to make up for the shortfall (i.e. customers are bumped).

It's worth remembering that a key part of cloud computing is a componentisation effect which is likely to lead to massively increased usage of computer infrastructure in ever more ephemeral infrastructures and as a result our dependency on this commodity provision will increase. It's all worth remembering that black swan events, like bank runs do occur.

If one overbooked provider collapses, then this is likely to create increased strain on other providers as users seek alternative sources of computer resource. Due to such an event and unexpected demand, this might lead to a temporary condition where some providers are not able to hand out additional capacity (i.e. new compute units) - the banking equivalent of closing the doors or localised brown-outs in the electricity industry.

However, people being people will tend to maximise the use of what they already have. Hence, if I'm renting 100 units with one provider who is collapsing, 100 units with another who isn't and a situation where many providers are closing their doors temporarily, then I'll tend to double up the workload where possible on my fully working 100 units (i.e where I believe I have spare capacity).

Unfortunately, I won't be the only one doing this and if that provider has overbooked then it'll collapse to some degree. The net effect is a potential cascade failure.

Now, this failure would not be the result of poor utility planning but instead the overbooking and hence overselling of capacity which does not exist, in much the same way that debt was sold beyond our capacity to service it. The providers have no way of predicting black swan events, nor can they estimate the uncertainty with user consumption (users, however, are more capable of predicting there own likely demands).

There are several solutions to this, however all require clear transparency on the level of overbooking. In the case of Amazon, Werner has made a clear statement that they don't overbook and sell your unused capacity i.e. you get exactly what you paid for.

Rackspace also states that they offer guaranteed and reserved levels of CPU, RAM and Storage with no over subscription (i.e. overbooking).

In this case of VMWare's vCloud Director, then according to James Watters they provide a mechanism for buying a hard reservation from a provider (i.e. a defined unit), with any over commitment being done by the user and under their control.

When it comes to choosing an infrastructure cloud provider, I can only recommend that you first start by asking them what units of compute resource they sell? Then afterwords, ask them whether you actually get that unit or merely a capacity for such depending upon what others are doing? In short, does a compute unit of 1Ghz core with 100Gb storage and 2Gb memory actually mean that or could it mean a lot less?

It's worth knowing exactly what you're getting for your buck.