Saturday, June 07, 2008

Don't Panic ...?

According to Pingdom and Network World, Amazon was down, kaput, not working and generally having an unplanned event.

Have no fear, in the world of "national grids" the failure of one power station isn't going to stop the provision of electricity. So, in the world of "cloud computing", I can just switch over to another provider.

Alas, No.

I exist in a world where portability and interoperability are replaced with an abundance of lock-in. Let's imagine we were consumers of Amazon's EC2 & S3. When they are down do we sit here twiddling our thumbs whilst wondering :-

  • Has any data has gone missing?
  • How am I going to find out?
  • What if they don't come back?
  • When are they coming back?
  • Should I start building my own infrastructure?
  • Should I have really fired my systems team?

In short do we just sit here thinking panic, panic, panic, panic .... phewww it's back again.

Now the people of Amazon are smart, so they will be taking every precaution. A long time ago, I spent a short amount of time doing complex risk analysis using a mix of quantitative and qualitative analysis. Amazon is bound to have hundreds of risk analysts doing the equivalent of Hazop and Hazan on their systems. Making sure that every system has a mass of redundancy, that every redundant system is different from the original and that every fault tree conceivable has been analysed. Systematic failures can be very painful when they hit and when the fault is the standard component you've implemented everywhere ... ouch. Despite all the precautions and measures they take, they will one day see their Black Swan.

When that happens we better have a simple switch over to another provider or we will be receiving the unpleasant lesson of second sourcing ourselves.


Unknown said...

However in this case, it appears to be a DOS attack and is aimed at the load balancers rather than the backend infrastructure.

No matter how many suppliers you used on the backend, you could not prevent this.

The attack is obviously calculated to hit the weakest point where all the traffic passes, despite huge upgrades Amazon have made in the past couple of years to defend themselves against such threats.

The AWS infrastructure was fine and doesn't have the same single pinch point, although if you don't think carefully about your own load balancing you are of course as vulnerable, probably more so.

Also did manage to stay up so their precautions did work to some small extent.

This comment is based on unofficial information, some of it speculative.


Anonymous said...


You touch upon a critical issue that hasn't really had a lot of coverage amidst all the hype about 'The Cloud'. To perform, and so make money, most modern enterprises rely on flows of data, and anything the interrupts these flows is potentially disastrous.

In recent times we have seen the London Stock Exchange fail, undersea data cables cut in the Gulf, espionage in Lithuania temporarily shutting down much of the economy, and a previous failure at Amazon which has one of the most modern data farms. We don't know the total cost of these failures but it is obviously a large number in any currency.

Businesses need to take great care before committing mission critical applications to The Cloud. For example, they should take account of the issues and economics associated with network traffic and data movements.

If we take the view that IT exists for one reason - to manage the flow of data between business assets - then we can understand how each flow of data moves across and through the enterprise. This enables us to accurately model the ‘big picture’ of the business and IT relationship, to value each flow of data, and to anticipate vulnerabilities where the continued flow of data may be at risk.

And having this big picture will help us minimise the amount of damage done by the dreaded 'black swans'.

Anonymous said...

Interesting interview with Nassim Nicholas Taleb on Black Swans etc earlier in the week, in case you missed it:

swardley said...

Hi Al,

Thanks for the comment

According to reports at the time (I was asleep) Amazon's S3 site and EC2 cloud platform were also down at the time. As for the cause, I haven't had a chance to look into it but it really isn't important for the argument.

Without portability in the SaaS world, there is no second sourcing. A single source position is strategically weak in terms of protecting the buyers interests in regards monopolistic opportunism, security and effective pricing.

Manufacturing learnt these lesson a long time ago. Looks like IT is hell bent on repeating them.

swardley said...

Hi Paul,

That's a fantastic analysis - crisp and to the point. Excellent post.

The services which are suitable for provision by the "cloud" are those which are uniform, ubiquitous and a cost of doing business.

At the application layer you have examples like email, payroll, HR, CRM and accounting systems all as candidates, at the platform level you have standard development and deployment environments and at the hardware level standard machines.

I italicised the term cloud as it is as ephemeral as its namesake. I prefer to consider it as a marketplace of utility computing providers in which you can switch between providers and competition is based upon service and price.

Such a market can only occur when there is portability between providers. Without such we have all sorts of single sourcing issues and weaknesses. Open source is the critical factor here.

As for the economics. Gray's analysis is obviously focused on P2P infrastructure and mobile apps rather than a world of large computing providers competing in a market and with consumers switching between them. This scenario is more in keeping with Gray's recommendation of keeping the computation near the data for most apps.

swardley said...

Hi Tony,

Thanks for that, I'll check it out.