Something weird and unexpected happened lately: the big Amazon cloud failed. Anyone on any media is talking about it, and everyone is communicating just this sense of surprise.
Wait a second and let’s ask to ourselves: why is it weird and unespected that AWS failed? The cloud is something human, so it has failed, as expected, and will fail again. By the way, I am sure that it failed a number of times in the past, but the failures weren’t so big to be noticed like the last big event.
I personally have had to deal with at least one case of a failure inside AWS: it was related to a VM that worked smoothly for months, but in my case the big difference was that AWS alerted me in advance (weeks before), saying that at a certain defined date that VM would have been destroyed (it had ephemeral storage) because the underlying hardware was failing in some way.
If someone ever thought that it would have never happened, he or she should be reflect on how much the hype (and FUD) around cloud computing influenced his judgment, and also on the fact that some competitors may be very interested in its every failure. Secondarily, it’s upon the application architect the responsibility of knowing the infrastructure on which the application will run and take the appropriate countermeasures against failure (exactly for this exists e.g. the AWS Architecture Center): if the chosen infrastructure is not adequate or its internals are not enough documented (as some inner parts of AWS), it’s up to the architect and its company to take the risks and pay for the consequences of the possible failures (i.e. blame also the companies that choose AWS without the right investment and knowledge about its availability). BTW, read this if you’re not yet convinced.
Last but not least, the reaction of Amazon (and of the other companies affected) is an example of how a company can react positively (besides the apology) even in front of the worst events: a detailed description of the problem, the solutions, and their timing and consequences, and more important, the decisions and the actions taken to avoid another similar event. I’m also sure that Amazon did not do everything in the perfect way, they make mistakes as anyone else, but I love so much how they are organized and how clearly they can describe their architecture, what happened and how they acted: roles and checklists are well defined and well understood, so they can decide and act very quickly.
Finally, I would like to stress that I’m writing about this issue not to defend a company like Amazon (I’m sure that they don’t need my help), but to point the attention of my colleagues on the fact that cloud computing is here to stay, it is not perfect as any other hosting solution (in some way, it seems now more real), but it has to be considered seriously as the best answer to at least some questions in IT and application architecture.
Update: after a bit more than a week since publication of my post, InfoQ published an interesting article entitled Cloud Computing Is Here to Stay. Apart from the word-by-word correspondence between the title of the article and the conclusion of my post, I’m adding a reference to it because it provides some numbers on real opinions and cloud usage by companies, data that lacked from my post and can instead help to understand better that “[...] cloud computing is no longer a possible technology of the future but one of today [...]“.