Wednesday, January 23, 2019

10 SaaS Essentials: #2 Highly Available

In this series I am discussing 10 essential characteristics of good Software-as-a-Service applications. In this second post I cover Essential #2: Highly Available, along with some thoughts on how to achieve High Availability (HA) on Microsoft Azure or Amazon Web Services. I'll be discussing specific features and cloud services of both AWS and Azure in this series, as I've implemented SaaS solutions on both platforms.

#2 Highly Available

Essential #2 is being Highly Available. If you have paying customers for your SaaS, odds are you're likely providing a vital service for them. Whether you're providing collaboration services or sales team services or HR services or legal services or developer services or legal services, your customers will want your service to be available any time they need it. That might be business hours for some, or 24 x 7 for others.

That being said, not all SaaS applications will have the same availability requirements. You should consider what needs to happen when a required cloud service is temporarily unavailable. Your options range from your app being temporarily unavailable (which might be acceptable for some applications) to some kind of alternative processing all the way to re-routing users to another deployment of your application in another region. Think through what your availability needs are and whether you're willing to invest the time, effort, and additional costs associated with some of the remedies we'll be discussing.

Before we get to the technical considerations of high availability, we need to discuss what availability means in terms of SLAs.

Setting Expectations with Availability Targets

What availability can you expect to provide, and what expectations should you set with prospects and clients? That's both an important question and a scary one: since your SaaS is highly dependent on a cloud platform that you don't control, what assurances can you provide that won't backfire on you? To answer this question, begin with the assurances your cloud platform gives. Uptime commitments for cloud providers are commonly expressed as a Service Level Agreement (SLA), in which you'll find availability targets. Understand that availability targets are merely objectives. They are not guarantees.

Understanding the Nines

Availability is usually quoted as a percentage of uptime: For example, 99.99% uptime/month ("4 nines") means you can expect 4.38 minutes of downtime a month, which works out to 52 minutes of downtime per year. Compare that to the similar-sounding 99.95%, which is 21.92 minutes of downtime a month, 4.38 hours a year. Here's a comparison of availability targets to put things in perspective.

Availability %
Also Known As
Downtime per Month
Downtime per Year
5 nines
26.3 seconds
5.26 minutes
4 nines
4.38 minutes
52.6 minutes

21.92 minutes
4.38 hours
3 nines
43.83 minutes
8.77 hours
2 nines
7.31 hours
3.65 days

Cloud Service SLAs

You can view the AWS Service Level Agreements and Azure Service Level Agreements online. Let's take a look at one. At the time of this writing, the AWS Compute Service Level Agreement (covering EC2 and several other services) commits to a monthly uptime target of 4 nines, 99.99%. Note, however, the fine print: there's a specific formula for how availability is measured. AWS will take "commercially reasonable efforts" to meet their availability target, not heroic measures. Other services you use will have their own separate SLA agreements, which might have lower availability. For example, AWS S3 is 99.9% (3 nines) as of this writing. The cloud providers' remedy for not meeting their target is merely a partial refund of your monthly bill, but the damages your clients might incur if your service is unavailable could be far higher. In formulating your own SLA, it's a good idea to clearly state the obligations and remedies assumed by your cloud platform, and then the obligations and remedies your company assumes.

Although it's helpful to know the expected availability of cloud services, that in itself tells you little about the uptime you can expect from your solution in the cloud. There are two reasons for that. The first is that you almost certainly consume multiple cloud services together, so you must calcualte a composite SLA. The second is that your application's architecture and implementation could compromise that SLA.

Composite SLAs

Let's use the term cloud workload to mean something you have running in the cloud that relies on a collection of services. That is, if even one of those cloud services isn't available then your workload can't run. In that case, your availability is going to be the arithmetic product of the SLAs of the individual services.

For example, let's say you have a web site hosted in EC2 that relies on an RDS Oracle database and also the S3 Storage Service. You review the latest SLAs, and find that EC2 has expected availability of 99.99%,  while RDS Oracle is 99.95%, and S3 is 99.9%. The composite SLA would be 

99.99% x 99.95% x 99.9% = 99.84%

Using an availability calculator, that's downtime of  1.17 hours a month or 14.03 hours a year.

On the other hand, if your application is written in such as way that it can get by without some cloud services when they aren't available, you can remove those services from the calculation. In the prior example, if the app is able to run just fine when S3 is down, the calculation becomes 99.99% x 99.95% = 99.94%. That's downtime of 26.30 minutes a month or 5.26 hours per year—much improved.

The composite SLA can be sobering, but it's important to have a realistic view of availability. Don't make the mistake of touting one cloud service's availability as your SaaS's overall availability: it won't be.

Not Compromising the SLA: Resiliency

Once you have an informed view of what availability to expect from the arrangement of cloud services you're using, you must now ensure your application doesn't compromse the SLA. This can be hard to do, but it's vital. One of the most important things you can do is make your application resilient to the transient failures that come with the cloud.

If your code simply fails when a needed resource isn't available, it isn't going to fare well in the cloud. This is especially true of legacy code, often written for an enterprise environment where there is a reasonable expectation that everything is available during business hours. If such code doesn't handle failures well, moving it to the cloud can seem okay at first—but the first time a needed resource isn't available what will be the effect on your application? It could result in data loss or perhaps even an improperly executed business transaction. A better coding strategy in cloud-hosted applications is to anticipate and proactively handle tempory unavailability.

There's a valuable article on resiliency in the Microsoft Azure documentation. Let's consider some of the recommended patterns for cloud application resiliency.


Your code usually has no problem connecting to your database, but this time it fails: the database service has become unavailable. Applying the retry pattern, your code could attempt retries before giving up on the idea of accessing the database. You would do this in a loop, with a delay between attempts. You don't want to overdo it, and under no circumstances should this be an infinite loop. On a successful connection you exit the loop and get back to work. If you complete the loop without a successful connection, it's time to either fail gracefully or consider one of the other patterns such as a fallback path.

This same approach can be applied to any cloud service you use, not just databases. Before implementing, check whether your API already performs retries; if it does, there's no reason to add redundant retry code. Many of the Cloud APIs not only perform retries but also implement exponential backoff algorithms.

Circuit Breaker

The retry pattern is a good technique when trying to connect to a cloud service, but what should we do if we zoom out to the level where a complete business process is invoked? The circuit breaker pattern, like a real-world house circuit breaker, trips when a certain threshold has been reached. When a business process is seen to be failing, the circuit breaker code around it can switch to an alternative processing path or make the process unavailable—but it should also periodically re-try the process code to see whether things are working again, in which case the process can be switched back on.


Temporary cloud service unavailability is sometimes worsened by code that uses the same client connection to access multiple resources of the same type. For example, you use the same web client to talk to multiple web services, or you use the same database client to talk to multiple partitions. When that's the case, it's possible an error such as an exhausted connection pool will compromise your client connection as it tries to deal with an availability problem—at the same time also denying you access to those other resources which may in fact be ready and available.

On a ship, bulkheads are the partitioned compartments that prevent a hull breach from flooding the entire ship. The bulkhead pattern says you should not only logically group your cloud resources into functional areas (like related web services or database partitions or storage accounts) but also make separate client connections to them. That way, a failure connecting to your shipping services won't also derail a connection to your wareheouse services.

Fallback Paths

One choice you have when a cloud service isn't available is to use a different cloud service and/or to defer the processing for another time. For example, you can handle the temporary unavailability of a database by writing transactions to a storage queue, where they can be processed later when the database is available.

Of course, not all activities require this level of resiliency. For some activities a try again later message to the user may suffice—for your SaaS, you'll have to make the call. The important thing is proactively planning how you will handle unavailability of the cloud services you make use of.

Graceful Degradation

It may be that your app simply doesn't have a good path forward because one or more cloud resources aren't available at the moment. Even in this case, there's a big difference in user experience between apps that simply give up vs. apps that strive for graceful degradation.

Graceful degradation means providing whatever limited functionality you can under failing circumstances: it's the difference between a bumpy landing and a crash landing to your users. If possible, preserve what information you can—perhaps the user has filled out a very large form—somewhere for later processing. It might mean displaying alternative guidance, such as a manual procedure the user can follow for their task. It might be simply putting up an error message, but in a friendly, reassuring way.

High Availability through Redundancy

Let's now talk about the arrangement of your software in the cloud, and what is needed to maintain high availability.

Avoid Single Points of Failure

While cloud platforms contain amazingly smart infrastructure, the hardware utilized is chosen to keep costs low. That means, at any given time, the server that hosts your web server VM could fail; or the database server hosting your database VM could fail; and so on. If you only have one of these components, you're down. But if there are multiple instances, with an ability to recover what was lost, your users don't even need to know that anything has happened.

High Availability for Web Farms

To make your web tier highly-available, you should run multiple instances in a web farm fronted by a load balancer. The minimum number of instances you want is 2: that way, a failure of one server instance won't kill off your application because the other instance will still be alive. If you use a PaaS service (which I highly recommend) like Azure Cloud Service or AWS Elastic Beanstalk, your lost instance will be automatically re-created: the farm will maintain the minimum instance count your configure.

Of course, a failure might be larger than just a single server. Your cloud region (data center) is further divided into smaller logical data centers, called availability zones. Services like Azure Cloud Services and AWS Beanstalk can be configured to deploy your instances across multiple availability zones. This allows your SaaS to remain available even if an entire availability zone fails.

Web Server HA in a Region

Thinking more extremely, It's also possible for an entire data center to become unavailable. If you're needing to manage even this condition, you'll need to deploy your application to multiple regions around the world. You can then use services like Azure Traffic Manager or AWS Route 53 to distribute web traffic to the most appropriate data center, based on availability and other criteria such as user location. You are of course paying significantly more for this level of redundancy. There are different ways to utilize multiple data centers: you could choose to operate a primary region with a secondary only used for failover; or you could have two or more regions in regular operation because your clientele are located in different geographies.

Site HA Across Regions

It's important for the web software to cooperate with the cloud HA model in terms of state. The ideal fit for the cloud's model for high availability is stateless web server software, meaning any kind of session state or global state information is stored in an external resource such as a distributed memory cache or database. As the load balancer distributes traffic across multiple instances, user session context is maintained—even if the cloud needs to reallocate a failed instance or increase/decrease the number of instances in order to auto-scale.

Legacy software that originally targeted a single server and keeps state in memory won't work in this model. One possible option in this case is to use a load balancer configured for server affinity ("sticky sessions"), which will continually route each client to the same instance. This remedy is not ideal, because cloud instances can always come and go. A better strategy is to update the software to store state in a distributed memory cache (such as Redis on AWS ElastiCache / Azure Cache) or cloud database (such as AWS DynamoDB, Azure CosmoDB, or your relational cloud database).

High Availability for Databases

If you're using a PaaS database service (highly recommended) for a relational database such as Azure SQL Database or Amazon RDS, those services will allocate multiple servers with automatic failover. The standby instances will be kept up-to-date as the primary is updated. In the event of a failure of the primary, a standby instance takes over. However, to span multiple availability zones, you'll need to configure your cloud service for multi-zone availability which will likely cost more. Your cloud service may also offer features for geo-replication which you should investigate.

What about globally available database solutions? There are various solutions and patterns for this on Azure and AWS for different databases. In general, adapting a traditional relational database to be global is going to require some work involving replication and synchronization, and if you're going in this direction you'll need to follow guidance from your cloud provider specific to that database and carefully consider the related costs and performance characteristics. You're strongly advised to follow the prescription for a proven global database pattern rather than inventing your own solution.

If you're really intent on global presence for your database, you're advised to consider using one of the newer cloud databases specifically designed for global scale. These include NoSQL databases like Azure's Cosmos DB and AWS's DynamoDB as well as relational databases like Amazon Aurora Global Database. There's a great deal of ongoing innovation by cloud providers around reinventing databases for high availability and scale. With these databases, there's significantly less work for you to do in configuring and operating them and there are fewer moving parts exposed to you. Many of them offer excellent performance and 4 9's availability (or better). Whether one of these databases is a good fit for you and the kinds of queries you need to perform is of course an individual consideration you'll have to research.

High Availability for your Services Tier

For traditional HTTP/HTTPS/TCP web services, you can follow the earlier web server guidance for high availability: configure at least 2 instances in a load-balanced farm, across multiple availability zones. If your services take their tasks from a queue or database, you can likewise configure a worker farm of multiple instances across multiple availability zones.

There is another kind of service, however, that takes advantage of the new server-less computing capabilities cloud platforms are now offering. Function-as-a-Service services like AWS Lambda and Azure Functions are frequently used for Microservices and provide built-in high availability. There are no pre-allocated servers to fail with your serverless function: the cloud platform will allocate servers as needed in response to demand. Availability comes out-of-the-box with server-less computing.

High Availability for Single-Instance Software

We've spent much of this post attesting to redundancy as a necessary ingredient in providing high availability, but what if there's a component of your solution that isn't capable of running on multiple instances? That might be something like a background service or a web service. The answer to this question lies in whether the component is required to always be available.

If a single-instance component does not need to be continually available, it can remain single instance. For example, a background service that only runs once overnight does not need to always be available. In that case, loss of the instance due to hardware failure would be replaced by the cloud infrastructure before long, the component would resume, and all would be well. The same would be true for any kind of processing worker that takes work from a non-interactive source such as a queue or database.

However, if a single-instance component must be available in order for the rest of the solution to function that is another story. For example, the temporary loss of a web service that processes payments could cause significant disruption. In this case, the component is a single point of failure and compromises high availability. In order to realize high availability, the service must be modified so that it can run multi-instance.

Test Availability—Or Your Customers Will!

High availability needs to be addressed for every tier of your SaaS application. When you think you have an HA architecture in place, you should prove it out—with some "sabotage": every recovery mechanism you think is in place needs to be proven out. For example, deallocate web server instances while the solution is being used: no one should be disrupted. If you've made your application resilient against temporary unavailability of a cloud service, re-configure that cloud service so that it is no longer accessible: did your planned graceful handling work as intended? If you've deployed to multiple regions with traffic management and failover, shut down your site in a region and observe what happens to users.

Monitor Availability

Cloud platforms offer some superb tools and services for achieving high availability. Nevertheless, it's easy to overlook a software component or fail to configure things correctly. Be sure to monitor operations and pay attention to your application's actual availability.

In Conclusion

Availability: it's being there for your clientele. Eliminate single points of failure and think horizontal in your application architecture: redundancy is your greatest tool for high availability in the cloud. Make your instances stateless, and code for resiliency. Above all, take advantage of the newer cloud services where high availability features are built in and automatic. Operations monitoring is essential.

No comments: