Bits or pieces?: Don't Panic ...?

According to Pingdom and Network World, Amazon was down, kaput, not working and generally having an unplanned event.

Have no fear, in the world of "national grids" the failure of one power station isn't going to stop the provision of electricity. So, in the world of "cloud computing", I can just switch over to another provider.

Alas, No.

I exist in a world where portability and interoperability are replaced with an abundance of lock-in. Let's imagine we were consumers of Amazon's EC2 & S3. When they are down do we sit here twiddling our thumbs whilst wondering :-

Has any data has gone missing?
How am I going to find out?
What if they don't come back?
When are they coming back?
Should I start building my own infrastructure?
Should I have really fired my systems team?

In short do we just sit here thinking panic, panic, panic, panic .... phewww it's back again.

Now the people of Amazon are smart, so they will be taking every precaution. A long time ago, I spent a short amount of time doing complex risk analysis using a mix of quantitative and qualitative analysis. Amazon is bound to have hundreds of risk analysts doing the equivalent of Hazop and Hazan on their systems. Making sure that every system has a mass of redundancy, that every redundant system is different from the original and that every fault tree conceivable has been analysed. Systematic failures can be very painful when they hit and when the fault is the standard component you've implemented everywhere ... ouch. Despite all the precautions and measures they take, they will one day see their Black Swan.

When that happens we better have a simple switch over to another provider or we will be receiving the unpleasant lesson of second sourcing ourselves.

Saturday, June 07, 2008

Don't Panic ...?