Saturday, February 23, 2013

Glass Half Empty: Understanding Cloud Computing SLAs

Let's talk about Cloud SLAs - a subject that is very important, but nevertheless is frequently  misunderstood,  over-simplified, or even outright ignored.

As I write this, Windows Azure has had an "up and down" week: on Tuesday 2/19/13, Windows Azure Storage was declared the leader in Cloud Storage by an independent report. A few days later on Friday 2/22/13, Windows Azure Storage suffered a worldwide outage - which shows you what a difference a couple of days can make to cloud vendor reputations. 

Given the events of this week, it seems timely to say some things about Cloud SLAs and correct some faulty reasoning that seems prevalent.

Service Level Agreements and High Availability
A service-level agreement is a contract between IT and a business to provide a certain level of service. In an enterprise an SLA could cover many things, including availability, response time, maintenance windows, and recovery time. In today's cloud computing environments, SLAs are largely about availability (uptime). So, let's talk about high availability.

Non-technical people say they want things up "all the time". Of course they do - but that's not reality. The chart below shows the "9s" categories for availability and what that means in terms of uptime.

Availability % Downtime/year Downtime/month
90% ("one 9") 36.5 days 72 hours
99% ("two 9's") 3.65 days 7.20 hours
99.9% ("three 9's") 8.76 hours 43.8 minutes
99.95% ("three and a half 9's") 4.38 hours 21.56 minutes
99.99% ("four 9's") 52.56 minutes 4.32 minutes
99.999% ("five 9's") 5.26 minutes 25.9 seconds

The key thing to understand is that as you increase the number of 9's the expense of meeting the SLA increases dramatically. Except for very simple systems or individual components, 4 9's is extremely expensive, and not available today in the public cloud. 5 9's or better is prohibitively expensive and quite rare.

Typical Availability Provided in the Cloud is 99.9% to 99.95%
Cloud platforms have individual SLAs for each service. The majority of cloud services have an SLA of 3 9's (99.9%), and some computing services offer 3.5 9's (99.95%). That means, a cloud service could be down 43 minutes a month and still be within its SLA. Is that an acceptable level of service? It depends. It could be a step up for you or a step down for you depending on what SLA you're used to - what does your local enterprise deliver? Interestingly, when I ask this question of clients the answer I receive most often is, "We don't know what our current SLA is"!

Availability Interruptions Don't Play Nice
The table above might lead you to believe that outage time is somehow evenly distributed but there's no guarantee of when an outage will occur or that outages will politely schedule themselves in small pockets of time over the course of the year. A service with a 99.9% SLA could be down 10 minutes a week, but it could also be down 43 minutes a month or even 8 hours straight once a year. You should assume the worst: an outage will occur during business hours on the day you can least afford it. Most cloud SLAs measure the SLA at a monthly level and provide some recompense if it is not met.

SLAs are not guarantees. Every cloud provider has missed meeting its SLA at times.

The Subtlety of SLAs by Service: Your Cloud Service SLA is not your Solution's SLA
Nearly all solutions in the cloud use multiple cloud services (for example, a web site might use Compute, Storage, and Database services). Note that the cloud providers do not provide an SLA for your solution, only for the individual services you consume.

An easy trap to fall into is assuming the minimum SLA for the cloud services you are using translates into an overall SLA for your solution. Let's demonstrate the fallacy of that thinking. Let's say your solution makes use of three cloud services, each with a monthly SLA of 99.9%. On the first day of the month, the first cloud service has a 40-minute outage but is otherwise up the remainder of the month - its still within its SLA. Now let's say the exact thing happens on the second day of the month to the second cloud service. And then, the same thing happens on the third day with your third cloud service. None of those services has violated its SLA, but your solution has been down 120 minutes that month.

The more services you use, the less uptime you can count on. Unless you're feeling lucky.

Using Multiple Data Centers Does Not Magically Increase 9's
I'm often asked if a 4 9's arrangement can be had by leveraging multiple data centers for failover. The answer is NO: while using multiple data centers and having good failover mechanisms in place does increase your likelihood of uptime, it is not a guarantee. Let's consider the best case and worst case scenario where you are using a primary cloud data center and have a failover mechanism in place for a secondary data center that is ready to take over at a moment's  notice. In the best case scenario, you experience no downtime at all in the primary data center, never need that second data center, and life is good. It could go that way. In the worst case scenario, you experience downtime in your primary data center and failover to the second data center -- but then that second data center fails too soon thereafter. If you don't think that could happen, consider that some of the cloud outages have been worldwide ones where all data centers are unavailable simultaneously.

Leveraging multiple data centers is a good idea - but don't represent that as adding more 9's to your availabiility. It's not a supportable claim.

How to Approach Cloud SLAs: Pessimistically
The only sane way to approach cloud computing SLAs (or any SLA for that matter) is to be extremely pessimistic and assume the worst possible case. If you can design mitigations and contingency plans for the worst case, you are well-prepared for any eventuality. If on the other hand you are "hoping for the best", your plans are extremely shaky. Murphy's Law should dominate your thinking when it comes to designing for failure.

Do not forget that Cloud SLAs are not a promise, they are a target. There may be some consequences such as refunds if your cloud vendor fails to meet their SLA, but that is usually little consolation for the costs of your business being down.

Cloud SLA Planning Best Practices
1. Become fully familiar with the SLA details of each cloud service you consume.
2. Keep in mind many cloud services are in preview/beta and may not be backed by any SLA.
3. Don't think of an SLA as a guarantee; your cloud provider will not always meet their SLA.
4. Do not confuse the availability SLA for individual cloud services with the overall uptime for your solution. There is no SLA for your solution, and the more services you use, the less uptime you are likely to have.
5. Do not plan on being lucky.
6. Build in contingency planning for a data center failure that allows you to fail-over to another data center. Yes, this will be extra work and increase your costs. Yes, it is worth it.
7. Remember that worldwide outages can occur, and you need a contingency plan for that scenario as well.

To Cloud, or Not to Cloud?
Does all of this mean the cloud is a bad place to run your applications? Not at all - it could well be an improvement over what your local experience is. And, there are significant benefits to running in the cloud that may more than offset the inconvenience of occasional downtime. The important thing is that you approach the cloud eyes-open and with realistic expectations.

Monday, February 11, 2013

Microsoft Certification Exam 70-481: Essentials of Developing Windows Store Apps using HTML5 and JavaScript

I am working toward the MCSD Windows Store Apps (Windows 8) certification, which is available in two flavors (JavaScript and XAML). Today I took and passed the second of three exams on the JavaScript track, 70-481: Essentials of Developing Windows Store Apps using HTML5 and JavaScript.

I'm not allowed to share details about the exam, of course, but I can share how I prepared for it. The first thing I'll say is that it is not an easy exam. I took it and narrowly missed passing a week ago, and on my re-take today I passed. So the first bit of advice is, be sure to get a Second Shot voucher from Prometric when you're signing up to take the exam - you might need it.

I chose the WinJS exam track because I've been deep on HTML5, CSS, and JavaScript for the last two years - my JavaScript skills are fresh and my XAML is a bit rusty. The first exam in the MCSD Windows Store Apps WinJS track is the same first exam of the MCSD Web Applications track, 70-480, which I'd already taken and blogged about. That's a convenient arrangement if you want to go for a "double major": a dual MCSD in Web Apps and Windows Store Apps. Here are the three exams in the MCSD Windows Store Apps WinJS track:

70-480: Programming in HTML5 with JavaScript and CSS
70-481: Essentials of Developing Windows Store Apps using HTML5 and JavaScript
70-482: Advanced Windows Store App Development using HTML5 and JavaScript

I think the main reason I found the exam difficult is that it was so comprehensive: there's an awful lot going on in Windows 8 and the exam covered a lot of territory. You can see the list of what's covered here in the Skills Measured section. Secondly, I think I faltered a bit the first time around on namespaces - I memorized them more solidly for the second time and I think that helped me better discern right answers from wrong ones.

Here's what I did to prepare for the exam:

Develop Windows 8 Apps
I've authored about 5 "real" Windows 8 apps, several of which have gone through the Windows Store process successfully. I think you need this hands-on experience in order to be prepared for the exam.

WinJS Samples
The WinJS samples are really valuable. There are many of them, - most of them short and to the point -- and they are a tremendous learning aid.

Read the documentation. Sure, some of the doc pages are dry or just list class details without much of an explanation - but there are also great conceptual pages.

I also found a lot of good information in blogs when I searched on various topics.

Walk the Skills Measured List
I went through the Skills Measured checklist and ensured I knew the basics about each skill listed.

Microsoft Virtual Academy
There are 8 hours or so of videos on Microsoft Virtual Academy to prep for this exam by Jeremy Foster and Michael Palermo - watch them. It's called the Developing Windows 8 Apps with HTML5 Jump Start (note: there's also a jump start course available for the first exam, 70-480).

• CodeShow
Jeremy Foster's CodeShow solution contains lots of great samples, all in one Visual Studio solution.

Well, there you have it. Next month I hope to summon the nerve to go for the Advanced exam and obtain my MCSD for Windows Store Apps.