Now, armchair technologists everywhere are weighing in with their opinions, which range from “See, I told you so: the cloud is just hype. You’re a fool to use it for anything mission critical.” to “This isolated incident is not a big deal.” and even “It’s your fault for not knowing better and having a contingency plan.” Many rendering their opinion are exhibiting a bias, and while it may be human nature to color your opinion based on whether you are pro-Microsoft or not, I’m going to try to rise above that. While I am a Windows Azure MVP and a fan of the platform, and certainly wish this hadn’t happened, I’m going to offer my take with a sincere attempt to neither minimize the real problems this caused businesses nor overstate the implications.
Acknowledging the Impact
First, I want to openly recognize the impact of this outage. If you’re running your business on a cloud platform and it goes down for some or all of a business day, this is extremely devastating. The reimbursement a cloud provider will give you for downtime is nothing compared to the business revenue you may have lost and the damage to reputation you may have incurred. Moreover, if your business is down your customers are also impacted.
One aspect of this particular outage that may have been particularly upsetting to customers is that some services were out on a worldwide basis for a time, including service management and the management portal. That interfered with one of the recovery patterns in the cloud, which is to switch over to an alternative data center. It also made it impossible to make new emergency deployments. This appears to be due to the nature of the problem being a software bug that affected all data centers, rather than the “equipment failure in a single data center” scenario that often gets a lot of the attention in cloud reliability architecture.
How Reliable Should We Expect Cloud Platforms to Be?
Microsoft certainly isn’t the only cloud provider who has an occasional problem. Last April, Amazon had a significant multi-day outage due to a “remirroring storm”. The gallery of online providers with problems in recent years includes SalesForce.com, Google Gmail, Twitter, PayPal, and Rackspace. Think about your own data center or the company that you use for hosting, and they probably have had issues from time to time as well.
Yet, cloud providers make a big deal about their failure-resistant architectures, which feature high redundancy of servers and data, distribution of resources across fault zones, intelligent management, and first-rate data centers and staff. Is it all hype, as some contend? Are cloud data centers no more reliable (or less reliable) than your own data center?
The truth is, cloud platforms are superbly engineered and do have amazing designs to uphold reliability—but, they have limits. Microsoft (or Amazon for that matter) can explain to you how their architecture and management safeguards your cloud assets so that even a significant equipment failure in a data center (such as a switch failure) doesn’t take out your application and data. This is not hype, it is true. But what about the statistical unlikelihood of multiple simultaneous failures? Or a software bug that affects all data centers? Cloud computing can't protect against every possibility. This does not mean the cloud should not be used; it does mean its reliability needs to be understood for what it is. Cloud data centers are extremely reliable, but they aren’t infallible. Just as it is statistically safer to fly than drive, we still have air disasters from time to time.
This recent outage illustrates the human factor in IT: people can and do make mistakes. While much of the magic in cloud data centers has come from automation and taking people out of the loop, people (and software written by people) will always be part of the mix. Since we won’t be minting perfect people anytime soon, the potential for human error remains. Of course, this is true of all data centers.
What Can Customers Do?
Having acknowledged the impact, let’s also point out that cloud providers do not promise 100% availability in the first place: typically, cloud platforms promise 3 to 3 ½ 9s for their services. That means you might be without a cloud service 6-8 hours a year even under the best of conditions—and need to plan for the possibility of being down for a business day, not knowing when that might be. While this recent outage was a bit longer than 8 hours for some customers, it was essentially being down for a day. Customers who took the SLA seriously and had made emergency provisions should have been able to switch to a contingency arrangement; those who never made those plans were stuck and felt helpless.
What should a contingency plan provide? It depends on how mission critical your cloud applications are and what you’re willing to pay for the insurance of extra availability. You can choose to wait out an outage; guard against single data center outages using alternative data centers; or have an alternative place to run altogether. Let this outage be the wake-up call to have an appropriate contingency plan in place.