Monday, April 12, 2010

Use cloud and get rid of your sysadmin.

Following on from my Cloud Computing Myths post.

The principle argument behind cloud getting rid of sysadmins is one of "pre-cloud a sysadmin can manage a few hundred machines, in the cloud era with automation a sysadmin can manage tens of thousands of virtual machines". In short, since system admins will be able to manage a two orders of magnitude greater number of virtual machines then we will need less of them.

Let's be first clear what automation means. At the infrastructure layer of the computing stack there are a range of systems, commonly known as orchestration tools, which allow for basic management of a cloud, automatic deployment of virtual infrastructure, configuration management, self-healing, monitoring, auto-scaling and so forth. These tools take advantage of the fact that in the cloud era, infrastructure is code and is created, modified and destroyed through APIs.

Rather than attempting to create specialised infrastructure, the cloud world takes advantage of a bountiful supply of virtual machines provided as standardised components. Hence scaling is achieved not through provision of an ever more powerful machine but deployment of vastly more standardised virtual machines.

Furthermore the concept of a machine also changes. We're moving away from the idea of a virtual machine image for this or that, to one of a basic machine image and all the run time information you require to configure it. The same base image will become a wiki, a web server or part of a n-tier system.

All of these capabilities allow for more ephemeral infrastructure, rapidly changing according to need with rapid deployment and destruction. This creates a range of management problems and hence we have the growth of interest in orchestration tools. These tools vary from specifically focused components to more general solutions and include chef, controltier,CohesiveFT, capistrano, rightscale, scalr and the list goes on.

A favourite example of mine, simply because it acts as a pointer towards the future, is PoolParty. Using a simple syntax of describing infrastructure deployment, PoolParty synthesises the core concepts of this infrastructure change. For example, deploying a system no longer becomes a long architectural review and planning process, an RT ticket requesting some new servers with an inevitable wait, the installation, racking and configuration of those servers followed with change control meetings.

Deploying a system becomes in principle as simple as :-

Pool "my_application" do
Cloud "my_application_server" do
Using EC2
Instance 1...1
Image_id "xxxxx"
Autoscale
end

Cloud "my_database_server" do
Using EC2
Instances 1...1
Image_id "xxxxx"
end

end

It is these concepts of infrastructure as code and automation through orchestration tools when combined with a future of computing resources provided as larger components (pre-built racks and containers) which have led many to assume that cloud will remove the roles of many sysadmins. This is a weak assumption.

A historical review of computing resource usage shows it's price elastic. In short, as the cost for provision of a unit of compute resource reduces then the demand has increased leading to today's proliferation of computing.

Now, depending upon who you talk to, the inefficiency of computer resources in your average data centre runs at 80-90%. Adoption of private clouds should (ignoring the benefits of using commodity hardware) provide a 5 x reduction in price per unit. Based upon historical precedents, you could expect this to be much higher in public cloud and lead to a 10-15x increase in consumption as we find the long tail of applications that companies desire becomes ever more feasible.

Of course, this ignores transient applications (those with a short life time such as weeks, days or hours), componentisation (e.g. self service and use of infrastructure as a base component), co-evolution effects and the larger economies of scale potentially available on public providers.

Given Moore's law, the current level of wastage, a standard VM / Physical server conversion rate, greater efficiencies in public provision, increasing use of commodity hardware and the assumption that expenditure of computing resources will remain flat (any reductions in cost per unit being compensated by increase in workload) then it is entirely feasible that within 5-7 years these effects could lead to a 100x increase in virtual infrastructure (i.e. number of virtual servers compared to current physical servers). It's more than possible that in five years time every large marketing department will have its own 1,000 node hadoop cluster for data processing of consumer behaviour.

So, we come back to the original argument which is "pre-cloud a sysadmin can manage a few hundred machines, in the cloud era with automation a sysadmin can manage tens of thousands of virtual machines". The problem with this argument is that if cloud develops as expected then each company will be managing two orders of magnitude more virtual machines which means there'll be at least as many sysadmins as there are today.

Now whilst the model changes when it comes to platform and software as a service (and there are complications here which I'll leave to another day), the assumption that cloud will lead to less system adminstrators is another one of those cloud myths which hasn't been properly thought through.

P.S. The nature of the role of a sysadmin will change and their skillsets will broaden, however if you're planning to use cloud to reduce their numbers then you might be in for a nasty shock.

P.P.S. Just to clarify, I've been asked by a company which runs 2,000 physical servers whether this means that in 5-7 years they could be running 200,000 virtual servers (some of which will be provided by private and most on public clouds, ideally through an exchange or brokers). This is exactly what I mean. You're going to need orchestration tools just to cope and you'll need sysadmins to be skilled in these and managing a much more complex environment.

Friday, April 09, 2010

Common Cloud Myths

Over the last three years, I've spent an increasingly disproportionate amount of my time dealing with cloud myths. I thought I'd catalogue my favourites by bashing one every other day.

Cloud is Green

The use of cloud infrastructure certainly allows for more efficient provision of infrastructure through matching supply to demand. In general :-

1. For a traditional scenario where every application has its own physical infrastructure then each application requires a capacity of compute resources, storage and network which must exceed its maximum load and provide suitable spare capacity for anticipated growth. This situation is often complicated by two factors. First, most applications contains multiple components and some of those often highly under utilise physical resources (for example load balancing). Second, due to the logistics of provisioning physical equipment then the excess capacity must be sufficiently large. At best, the total compute resources required will significantly exceed the sum of all the individual peak application loads and spare capacity.

2. The shared infrastructure scenario covers networks, storage and compute resources (through virtualisation). Resource requirements are balanced across multiple applications with variable loads and the total spare capacity held is significantly reduced. In an optimal case the total capacity can be reduced to a general spare capacity plus the peak of the sum of the application loads. Virtual Data Centres, provisioning resources according to need, are an example of shared infrastructure.

3. In the case of a private cloud (i.e. a private compute utility), the economics are close to that of a shared scenario. However, there is one important distinction in that a compute utility is about commodity infrastructure. For example, virtual data centres provide highly resilient virtual infrastructure which incur significant costs whereas a private cloud focuses on rapid provision of low cost, good enough virtual infrastructure.

At the nodes (the servers providing virtual machines) of a private cloud, redundant power supplies are seen as an unnecessary cost rather than a benefit. This ruthless focus on commodity infrastructure provides a lower price point per virtual machine but that necessitates that resilience is created in the management layer and application (the design for failure concept). The reasoning for this, is the same reasoning behind RAID (redundant array of inexpensive disks). By pushing resilience into the management layer and combining more lower cost, less resilient hardware you can actually enable higher levels of resilience and performance for a given price point.

However, the downside is that you can't just take what has existed on physical servers and plonk it on a cloud and expect it to work like a highly resilient physical server. You can however do this with a virtual data centre.

This distinction and focus on commodity provision is the difference between a virtual data centre and a private cloud. It's a very subtle but massively important distinction because whilst a virtual data centre has the benefit of reducing educational costs of transition in the short term (being like existing physical environments), it's exactly these characteristics that will make it inefficient compared to private clouds in the longer term.

4. In the case of a public cloud infrastructure (a public compute utility), the concepts are taken further by balancing variable demands of one company for compute resources against another. This is one of many potential economies of scale that can lead to lower unit costs. However unit cost is only one consideration here, there are transitional and outsourcing risks that need to be factored in which is why we often use hybrid solutions combining both public and private clouds.

The overall effect of moving through these different stages is that the provision of infrastructure becomes more efficient and hence we have the "cloud is green" assumption.

I pointed out, back in 2008 at IT@Cork, that this assumption ignored co-evolution, componentisation and price elasticity effects.

By increasing efficiency and reducing cost for provision of infrastructure, a large number of activities which might have once not been economically feasible become economically feasible. Furthermore, the self-service nature of cloud not only increases agility by enabling faster provision of infrastructure but accelerates user innovation through provision of standardised components (i.e. the infrastructure equivalent of a brick). This latter effect can encourage the co-evolution of new industries in the same manner that the commoditisation of electronic switching (from the innovation of the Flemming valve to complex products containing thousands of switches) led to digital calculators and computers which in turn drove further commoditisation and demand for electronic switching.

The effect of these forces is that whilst infrastructure provision may become more efficient, the overall demand for infrastructure will outstrip these gains precisely because infrastructure has become a more efficient and standardised component.

We end up using vastly more of a more efficient resource. Lo and behold, cloud turns out not to be green.

The same effect was noted by Willam Stanley Jevons in the 1850s, when he "observed that England's consumption of coal soared after James Watt introduced his coal-fired steam engine, which greatly improved the efficiency of Thomas Newcomen's earlier design"