Now, armchair technologists everywhere are weighing in with their
opinions, which range from “See, I told
you so: the cloud is just hype. You’re a fool to use it for anything mission
critical.” to “This isolated incident
is not a big deal.” and even “It’s
your fault for not knowing better and having a contingency plan.” Many rendering
their opinion are exhibiting a bias, and while it may be human nature to color
your opinion based on whether you are pro-Microsoft or not, I’m going to try to
rise above that. While I am a Windows Azure MVP and a fan of the platform, and
certainly wish this hadn’t happened, I’m going to offer my take with a sincere
attempt to neither minimize the real problems this caused businesses nor
overstate the implications.
Acknowledging
the Impact
First, I want to openly recognize the impact of this outage. If
you’re running your business on a cloud platform and it goes down for some or
all of a business day, this is extremely devastating. The reimbursement a cloud
provider will give you for downtime is nothing compared to the business revenue
you may have lost and the damage to reputation you may have incurred. Moreover,
if your business is down your customers are also impacted.
One aspect of this particular outage that may have been
particularly upsetting to customers is that some services were out on a worldwide
basis for a time, including service management and the management portal. That
interfered with one of the recovery patterns in the cloud, which is to switch
over to an alternative data center. It also made it impossible to make new emergency
deployments. This appears to be due to the nature of the problem being a
software bug that affected all data centers, rather than the “equipment failure
in a single data center” scenario that often gets a lot of the attention in cloud
reliability architecture.
How
Reliable Should We Expect Cloud Platforms to Be?
Microsoft certainly isn’t the only cloud provider who has an
occasional problem. Last April, Amazon had a significant multi-day outage due
to a “remirroring storm”. The gallery of online providers with problems in
recent years includes SalesForce.com, Google Gmail, Twitter, PayPal, and Rackspace. Think about
your own data center or the company that you use for hosting, and they probably
have had issues from time to time as well.
Yet, cloud providers make a big deal about their failure-resistant
architectures, which feature high redundancy of servers and data, distribution
of resources across fault zones, intelligent management, and first-rate data
centers and staff. Is it all hype, as
some contend? Are cloud data centers no more reliable (or less reliable) than
your own data center?
The truth is, cloud platforms are superbly engineered and
do have amazing designs to uphold reliability—but, they have limits. Microsoft
(or Amazon for that matter) can explain to you how their architecture and
management safeguards your cloud assets so that even a significant equipment
failure in a data center (such as a switch failure) doesn’t take out your application
and data. This is not hype, it is true. But what about the statistical
unlikelihood of multiple simultaneous failures? Or a software bug that affects
all data centers? Cloud computing can't protect against every possibility. This does not mean the cloud should not be
used; it does mean its reliability needs to be understood for what it is. Cloud
data centers are extremely reliable, but they aren’t infallible.
Just as it is statistically safer to fly than drive, we still have air
disasters from time to time.
This recent outage illustrates the human factor in IT: people can
and do make mistakes. While much of the magic in cloud data centers has come
from automation and taking people out of the loop, people (and software written
by people) will always be part of the mix. Since we won’t be minting perfect
people anytime soon, the potential for human error remains. Of course, this is
true of all data centers.
What Can
Customers Do?
Having acknowledged the impact, let’s also point out that cloud
providers do not promise 100% availability in the first place: typically, cloud
platforms promise 3 to 3 ½ 9s for their services. That means you might be
without a cloud service 6-8 hours a year even under the best of conditions—and need
to plan for the possibility of being down for a business day, not knowing when
that might be. While this recent outage was a bit longer than 8 hours for some
customers, it was essentially being down for a day. Customers who took the SLA
seriously and had made emergency provisions should have been able to switch to
a contingency arrangement; those who never made those plans were stuck and felt
helpless.
What should a contingency plan provide? It depends on how mission
critical your cloud applications are and what you’re willing to pay for the
insurance of extra availability. You can choose to wait out an outage; guard
against single data center outages using alternative data centers; or have an
alternative place to run altogether. Let this outage be the wake-up call to have
an appropriate contingency plan in place.
5 comments:
VERY well written!
Okay. I'm a Microsoft guy too. But it seems to me if we have to pay for backup redundant systems, why not just host it ourselves. The cost, especially on high available enterprise systems, doesn't seem to be that different.
But the issue were between all data centers. this is a problem since we have less hardwarein our cimpany... so what are precentage ee need to put?
I like your comments and your professional demeanor regarding this. I'll be tweeting your blog to followers. Not having a private cloud and/or own data center backup plan as a failover is simply irresponsible and certainly not the fault of MS. Our company, RyanTech, is so Microsoft centric that we don't hold them at fault yesterday, we do find it ironic in some cases. That all said, we had zero production infrastructures fail yesterday and will continue to design for the best in class service that customers deserve, no matter the cloud platform and no matter the personal choice on providers. Nice work David.
I would like to agree. However, the 29th Feb 2012 outage. It affects both Azure Compute region in North Central US ans South Central US. Not forgetting the North EU region. Yeah, we could redeploy our azure into different geographical location such as SEA,East Asian, Western EU sacrificing the network latency. Unfortunately, the Azure Platform Management Portal is not reliable on the 29th Feb 2012. I do appreciate to know what is the alternative should we face the same challenge ?
Post a Comment