Saturday, February 23, 2013

Glass Half Empty: Understanding Cloud Computing SLAs

Let's talk about Cloud SLAs - a subject that is very important, but nevertheless is frequently  misunderstood,  over-simplified, or even outright ignored.

As I write this, Windows Azure has had an "up and down" week: on Tuesday 2/19/13, Windows Azure Storage was declared the leader in Cloud Storage by an independent report. A few days later on Friday 2/22/13, Windows Azure Storage suffered a worldwide outage - which shows you what a difference a couple of days can make to cloud vendor reputations. 

Given the events of this week, it seems timely to say some things about Cloud SLAs and correct some faulty reasoning that seems prevalent.

Service Level Agreements and High Availability
A service-level agreement is a contract between IT and a business to provide a certain level of service. In an enterprise an SLA could cover many things, including availability, response time, maintenance windows, and recovery time. In today's cloud computing environments, SLAs are largely about availability (uptime). So, let's talk about high availability.

Non-technical people say they want things up "all the time". Of course they do - but that's not reality. The chart below shows the "9s" categories for availability and what that means in terms of uptime.

Availability % Downtime/year Downtime/month
90% ("one 9") 36.5 days 72 hours
99% ("two 9's") 3.65 days 7.20 hours
99.9% ("three 9's") 8.76 hours 43.8 minutes
99.95% ("three and a half 9's") 4.38 hours 21.56 minutes
99.99% ("four 9's") 52.56 minutes 4.32 minutes
99.999% ("five 9's") 5.26 minutes 25.9 seconds

The key thing to understand is that as you increase the number of 9's the expense of meeting the SLA increases dramatically. Except for very simple systems or individual components, 4 9's is extremely expensive, and not available today in the public cloud. 5 9's or better is prohibitively expensive and quite rare.

Typical Availability Provided in the Cloud is 99.9% to 99.95%
Cloud platforms have individual SLAs for each service. The majority of cloud services have an SLA of 3 9's (99.9%), and some computing services offer 3.5 9's (99.95%). That means, a cloud service could be down 43 minutes a month and still be within its SLA. Is that an acceptable level of service? It depends. It could be a step up for you or a step down for you depending on what SLA you're used to - what does your local enterprise deliver? Interestingly, when I ask this question of clients the answer I receive most often is, "We don't know what our current SLA is"!

Availability Interruptions Don't Play Nice
The table above might lead you to believe that outage time is somehow evenly distributed but there's no guarantee of when an outage will occur or that outages will politely schedule themselves in small pockets of time over the course of the year. A service with a 99.9% SLA could be down 10 minutes a week, but it could also be down 43 minutes a month or even 8 hours straight once a year. You should assume the worst: an outage will occur during business hours on the day you can least afford it. Most cloud SLAs measure the SLA at a monthly level and provide some recompense if it is not met.

SLAs are not guarantees. Every cloud provider has missed meeting its SLA at times.

The Subtlety of SLAs by Service: Your Cloud Service SLA is not your Solution's SLA
Nearly all solutions in the cloud use multiple cloud services (for example, a web site might use Compute, Storage, and Database services). Note that the cloud providers do not provide an SLA for your solution, only for the individual services you consume.

An easy trap to fall into is assuming the minimum SLA for the cloud services you are using translates into an overall SLA for your solution. Let's demonstrate the fallacy of that thinking. Let's say your solution makes use of three cloud services, each with a monthly SLA of 99.9%. On the first day of the month, the first cloud service has a 40-minute outage but is otherwise up the remainder of the month - its still within its SLA. Now let's say the exact thing happens on the second day of the month to the second cloud service. And then, the same thing happens on the third day with your third cloud service. None of those services has violated its SLA, but your solution has been down 120 minutes that month.

The more services you use, the less uptime you can count on. Unless you're feeling lucky.

Using Multiple Data Centers Does Not Magically Increase 9's
I'm often asked if a 4 9's arrangement can be had by leveraging multiple data centers for failover. The answer is NO: while using multiple data centers and having good failover mechanisms in place does increase your likelihood of uptime, it is not a guarantee. Let's consider the best case and worst case scenario where you are using a primary cloud data center and have a failover mechanism in place for a secondary data center that is ready to take over at a moment's  notice. In the best case scenario, you experience no downtime at all in the primary data center, never need that second data center, and life is good. It could go that way. In the worst case scenario, you experience downtime in your primary data center and failover to the second data center -- but then that second data center fails too soon thereafter. If you don't think that could happen, consider that some of the cloud outages have been worldwide ones where all data centers are unavailable simultaneously.

Leveraging multiple data centers is a good idea - but don't represent that as adding more 9's to your availabiility. It's not a supportable claim.

How to Approach Cloud SLAs: Pessimistically
The only sane way to approach cloud computing SLAs (or any SLA for that matter) is to be extremely pessimistic and assume the worst possible case. If you can design mitigations and contingency plans for the worst case, you are well-prepared for any eventuality. If on the other hand you are "hoping for the best", your plans are extremely shaky. Murphy's Law should dominate your thinking when it comes to designing for failure.

Do not forget that Cloud SLAs are not a promise, they are a target. There may be some consequences such as refunds if your cloud vendor fails to meet their SLA, but that is usually little consolation for the costs of your business being down.

Cloud SLA Planning Best Practices
1. Become fully familiar with the SLA details of each cloud service you consume.
2. Keep in mind many cloud services are in preview/beta and may not be backed by any SLA.
3. Don't think of an SLA as a guarantee; your cloud provider will not always meet their SLA.
4. Do not confuse the availability SLA for individual cloud services with the overall uptime for your solution. There is no SLA for your solution, and the more services you use, the less uptime you are likely to have.
5. Do not plan on being lucky.
6. Build in contingency planning for a data center failure that allows you to fail-over to another data center. Yes, this will be extra work and increase your costs. Yes, it is worth it.
7. Remember that worldwide outages can occur, and you need a contingency plan for that scenario as well.

To Cloud, or Not to Cloud?
Does all of this mean the cloud is a bad place to run your applications? Not at all - it could well be an improvement over what your local experience is. And, there are significant benefits to running in the cloud that may more than offset the inconvenience of occasional downtime. The important thing is that you approach the cloud eyes-open and with realistic expectations.


Sivaramakrishnan Vaidyanathan said...

Superb blog post, extremely informative and interesting. Well written! Thank you.

Amit said...

Great analysis, covers all aspects of the current situation.

Beaverton SEO said...

Thanks for sharing this. More power to you!

Unknown said...

Good analysis about Cloud SLA's.

David V. Corbin said...

Excellent post. Of course you can help by picking alternate datacenters from DIFFERENT providers. This may dramatically increase the cost (due to differences in the environments), but the odds of Amazon, Microsoft, (plus a third and fourth provider) all having problems at the same time is extremely rare.