Sunday, March 4, 2012

On the Recent Windows Azure Leap Day Outage

On February 29th Windows Azure suffered a widespread service disruption, which as per a Microsoft statement appears to have been caused by “a time calculation that was incorrect for the leap year”. By the time a fix was devised and rolled out and consequences of the original problem were dealt with, customers were back up and running as early as 3 am PT (most customers, as per the Microsoft statement) or as late as 5-6pm (which is what I and my customers experienced). From what I understand, there was an availability impact only and no data loss.

Now, armchair technologists everywhere are weighing in with their opinions, which range from “See, I told you so: the cloud is just hype. You’re a fool to use it for anything mission critical.” to “This isolated incident is not a big deal.” and even “It’s your fault for not knowing better and having a contingency plan.” Many rendering their opinion are exhibiting a bias, and while it may be human nature to color your opinion based on whether you are pro-Microsoft or not, I’m going to try to rise above that. While I am a Windows Azure MVP and a fan of the platform, and certainly wish this hadn’t happened, I’m going to offer my take with a sincere attempt to neither minimize the real problems this caused businesses nor overstate the implications.

Acknowledging the Impact

First, I want to openly recognize the impact of this outage. If you’re running your business on a cloud platform and it goes down for some or all of a business day, this is extremely devastating. The reimbursement a cloud provider will give you for downtime is nothing compared to the business revenue you may have lost and the damage to reputation you may have incurred. Moreover, if your business is down your customers are also impacted.

One aspect of this particular outage that may have been particularly upsetting to customers is that some services were out on a worldwide basis for a time, including service management and the management portal. That interfered with one of the recovery patterns in the cloud, which is to switch over to an alternative data center. It also made it impossible to make new emergency deployments. This appears to be due to the nature of the problem being a software bug that affected all data centers, rather than the “equipment failure in a single data center” scenario that often gets a lot of the attention in cloud reliability architecture.

How Reliable Should We Expect Cloud Platforms to Be?

Microsoft certainly isn’t the only cloud provider who has an occasional problem. Last April, Amazon had a significant multi-day outage due to a “remirroring storm”. The gallery of online providers with problems in recent years includes SalesForce.com, Google Gmail, Twitter, PayPal, and Rackspace. Think about your own data center or the company that you use for hosting, and they probably have had issues from time to time as well.

Yet, cloud providers make a big deal about their failure-resistant architectures, which feature high redundancy of servers and data, distribution of resources across fault zones, intelligent management, and first-rate data centers and  staff. Is it all hype, as some contend? Are cloud data centers no more reliable (or less reliable) than your own data center?

The truth is, cloud platforms are superbly engineered and do have amazing designs to uphold reliability—but, they have limits. Microsoft (or Amazon for that matter) can explain to you how their architecture and management safeguards your cloud assets so that even a significant equipment failure in a data center (such as a switch failure) doesn’t take out your application and data. This is not hype, it is true. But what about the statistical unlikelihood of multiple simultaneous failures? Or a software bug that affects all data centers? Cloud computing can't protect against every possibility. This does not mean the cloud should not be used; it does mean its reliability needs to be understood for what it is. Cloud data centers are extremely reliable, but they aren’t infallible. Just as it is statistically safer to fly than drive, we still have air disasters from time to time.

This recent outage illustrates the human factor in IT: people can and do make mistakes. While much of the magic in cloud data centers has come from automation and taking people out of the loop, people (and software written by people) will always be part of the mix. Since we won’t be minting perfect people anytime soon, the potential for human error remains. Of course, this is true of all data centers.

What Can Customers Do?

Having acknowledged the impact, let’s also point out that cloud providers do not promise 100% availability in the first place: typically, cloud platforms promise 3 to 3 ½ 9s for their services. That means you might be without a cloud service 6-8 hours a year even under the best of conditions—and need to plan for the possibility of being down for a business day, not knowing when that might be. While this recent outage was a bit longer than 8 hours for some customers, it was essentially being down for a day. Customers who took the SLA seriously and had made emergency provisions should have been able to switch to a contingency arrangement; those who never made those plans were stuck and felt helpless.

What should a contingency plan provide? It depends on how mission critical your cloud applications are and what you’re willing to pay for the insurance of extra availability. You can choose to wait out an outage; guard against single data center outages using alternative data centers; or have an alternative place to run altogether. Let this outage be the wake-up call to have an appropriate contingency plan in place.

5 comments:

Ryan said...

VERY well written!

Anonymous said...

Okay. I'm a Microsoft guy too. But it seems to me if we have to pay for backup redundant systems, why not just host it ourselves. The cost, especially on high available enterprise systems, doesn't seem to be that different.

פיני קרישר said...

But the issue were between all data centers. this is a problem since we have less hardwarein our cimpany... so what are precentage ee need to put?

CassiData said...

I like your comments and your professional demeanor regarding this. I'll be tweeting your blog to followers. Not having a private cloud and/or own data center backup plan as a failover is simply irresponsible and certainly not the fault of MS. Our company, RyanTech, is so Microsoft centric that we don't hold them at fault yesterday, we do find it ironic in some cases. That all said, we had zero production infrastructures fail yesterday and will continue to design for the best in class service that customers deserve, no matter the cloud platform and no matter the personal choice on providers. Nice work David.

einsamsoldat said...

I would like to agree. However, the 29th Feb 2012 outage. It affects both Azure Compute region in North Central US ans South Central US. Not forgetting the North EU region. Yeah, we could redeploy our azure into different geographical location such as SEA,East Asian, Western EU sacrificing the network latency. Unfortunately, the Azure Platform Management Portal is not reliable on the 29th Feb 2012. I do appreciate to know what is the alternative should we face the same challenge ?