#2 Highly Available
Essential #2 is being Highly Available. If you have paying customers for your SaaS, odds are you're likely providing a vital service for them. Whether you're providing collaboration services or sales team services or HR services or legal services or developer services or legal services, your customers will want your service to be available any time they need it. That might be business hours for some, or 24 x 7 for others.
That being said, not all SaaS applications will have the
same availability requirements. You should consider what needs to happen when a
required cloud service is temporarily unavailable. Your options range from your
app being temporarily unavailable (which might be acceptable for some
applications) to some kind of alternative processing all the way to re-routing
users to another deployment of your application in another region. Think
through what your availability needs are and whether you're willing to invest
the time, effort, and additional costs associated with some of the remedies
we'll be discussing.
Before we get to the technical considerations of high
availability, we need to discuss what availability means in terms of SLAs.
Setting Expectations with Availability Targets
What availability can you expect to provide, and what
expectations should you set with prospects and clients? That's both an
important question and a scary one: since your SaaS is highly dependent on a
cloud platform that you don't control, what assurances can you provide that
won't backfire on you? To answer this question, begin with the assurances your
cloud platform gives. Uptime commitments for cloud providers are commonly
expressed as a Service Level Agreement (SLA), in which you'll find availability
targets. Understand that availability targets are merely objectives. They are
not guarantees.
Understanding the Nines
Availability is usually quoted as a percentage of uptime: For
example, 99.99% uptime/month ("4 nines") means you can expect 4.38
minutes of downtime a month, which works out to 52 minutes of downtime per
year. Compare that to the similar-sounding 99.95%, which is 21.92 minutes of
downtime a month, 4.38 hours a year. Here's a comparison of availability
targets to put things in perspective.
Availability %
|
Also Known As
|
Downtime per Month
|
Downtime per Year
|
99.999%
|
5 nines
|
26.3 seconds
|
5.26 minutes
|
99.99%
|
4 nines
|
4.38 minutes
|
52.6 minutes
|
99.95%
|
21.92 minutes
|
4.38 hours
|
|
99.9%
|
3 nines
|
43.83 minutes
|
8.77 hours
|
99%
|
2 nines
|
7.31 hours
|
3.65 days
|
Cloud Service SLAs
You can view the AWS Service Level
Agreements and Azure Service Level
Agreements online. Let's take a look at one. At the time of this writing,
the AWS Compute Service Level
Agreement (covering EC2 and several other services) commits to a monthly
uptime target of 4 nines, 99.99%. Note, however, the fine print: there's a
specific formula for how availability is measured. AWS will take "commercially
reasonable efforts" to meet their availability target, not heroic
measures. Other services you use will have their own separate SLA agreements, which
might have lower availability. For example, AWS S3 is 99.9% (3 nines) as of
this writing. The cloud providers' remedy for not meeting their target is merely
a partial refund of your monthly bill, but the damages your clients might incur
if your service is unavailable could be far higher. In formulating your own
SLA, it's a good idea to clearly state the obligations and remedies assumed by your cloud
platform, and then the obligations and remedies your company assumes.
Although it's helpful to know the expected availability of
cloud services, that in itself tells you little about the uptime you can expect
from your solution in the cloud. There are two reasons for that. The first is
that you almost certainly consume multiple cloud services together, so you must
calcualte a composite SLA. The second is that your application's architecture
and implementation could compromise that SLA.
Composite SLAs
Let's use the term cloud workload to mean something you have
running in the cloud that relies on a collection of services. That is, if even
one of those cloud services isn't available then your workload can't run. In
that case, your availability is going to be the arithmetic product of the SLAs
of the individual services.
For example, let's say you have a web site hosted in EC2
that relies on an RDS Oracle database and also the S3 Storage Service. You
review the latest SLAs, and find that EC2 has expected availability of 99.99%, while RDS Oracle is 99.95%, and S3 is 99.9%.
The composite SLA would be
99.99% x 99.95% x 99.9% = 99.84%
On the other hand, if your application is written in such as
way that it can get by without some cloud services when they aren't available,
you can remove those services from the calculation. In the prior example, if
the app is able to run just fine when S3 is down, the calculation becomes
99.99% x 99.95% = 99.94%. That's downtime of 26.30 minutes a month or 5.26
hours per year—much improved.
The composite SLA can be sobering, but it's important to
have a realistic view of availability. Don't make the mistake of touting one
cloud service's availability as your SaaS's overall availability: it won't be.
Not Compromising the SLA: Resiliency
Once you have an informed view of what availability to
expect from the arrangement of cloud services you're using, you must now ensure your application doesn't compromse
the SLA. This can be hard to do, but it's vital. One of the most important
things you can do is make your application resilient to the transient failures
that come with the cloud.
If your code simply fails when a needed resource isn't
available, it isn't going to fare well in the cloud. This is especially true of
legacy code, often written for an enterprise environment where there is a
reasonable expectation that everything is available during business hours. If
such code doesn't handle failures well, moving it to the cloud can seem okay at
first—but the first time a needed resource isn't available what will be the
effect on your application? It could result in data loss or perhaps even an
improperly executed business transaction. A better coding strategy in
cloud-hosted applications is to anticipate and proactively handle tempory
unavailability.
There's a valuable article
on resiliency in the Microsoft Azure documentation. Let's consider some of
the recommended patterns for cloud application resiliency.
Retry
Your code usually has no problem connecting to your
database, but this time it fails: the database service has become unavailable. Applying
the retry
pattern, your code could attempt retries before giving up on the idea of
accessing the database. You would do this in a loop, with a delay between
attempts. You don't want to overdo it, and under no circumstances should this
be an infinite loop. On a successful connection you exit the loop and get back
to work. If you complete the loop without a successful connection, it's time to
either fail gracefully or consider one of the other patterns such as a fallback
path.
This same approach can be applied to any cloud service you
use, not just databases. Before implementing, check whether your API already
performs retries; if it does, there's no reason to add redundant retry code.
Many of the Cloud APIs not only perform retries but also implement exponential
backoff algorithms.
Circuit Breaker
The retry pattern is a good technique when trying to connect to a cloud service, but what should we do if we zoom out to the level where a complete business process is invoked? The circuit breaker pattern, like a real-world house circuit breaker, trips when a certain threshold has been reached. When a business process is seen to be failing, the circuit breaker code around it can switch to an alternative processing path or make the process unavailable—but it should also periodically re-try the process code to see whether things are working again, in which case the process can be switched back on.Bulkhead
On a ship, bulkheads are the partitioned compartments that
prevent a hull breach from flooding the entire ship. The bulkhead
pattern says you should not only logically group your cloud resources into
functional areas (like related web services or database partitions or storage
accounts) but also make separate client connections to them. That way, a
failure connecting to your shipping services won't also derail a connection to
your wareheouse services.
Fallback Paths
One choice you have when a cloud service isn't available is to use a different cloud service and/or to defer the processing for another
time. For example, you can handle the temporary unavailability of a database by
writing transactions to a storage queue, where they can be processed later when
the database is available.
Of course, not all activities require this level of
resiliency. For some activities a try
again later message to the user may suffice—for your SaaS, you'll have to
make the call. The important thing is proactively planning how you will handle
unavailability of the cloud services you make use of.
Graceful Degradation
High Availability through Redundancy
Let's now talk about the arrangement of your software in the
cloud, and what is needed to maintain high availability.
Avoid Single Points of Failure
While cloud platforms contain amazingly smart infrastructure, the hardware utilized is
chosen to keep costs low. That means, at any given time, the server that hosts
your web server VM could fail; or the database server hosting your database VM
could fail; and so on. If you only have one of these components, you're down.
But if there are multiple instances, with an ability to recover what was lost,
your users don't even need to know that anything has happened.
High Availability for Web Farms
To make your web tier highly-available, you should run
multiple instances in a web farm fronted by a load balancer. The minimum number
of instances you want is 2: that way, a failure of one server instance won't
kill off your application because the other instance will still be alive. If
you use a PaaS service (which I highly recommend) like Azure Cloud Service or
AWS Elastic Beanstalk, your lost instance will be automatically re-created: the
farm will maintain the minimum instance count your configure.
Of course, a failure might be larger than just a single
server. Your cloud region (data center) is further divided into smaller logical
data centers, called availability zones. Services like Azure Cloud Services and
AWS Beanstalk can be configured to deploy your instances across multiple
availability zones. This allows your SaaS to remain available even if an entire
availability zone fails.
Web Server HA in a Region
Thinking more extremely, It's also possible for an entire
data center to become unavailable. If you're needing to manage even this
condition, you'll need to deploy your application to multiple regions around
the world. You can then use services like Azure Traffic Manager or AWS Route 53
to distribute web traffic to the most appropriate data center, based on
availability and other criteria such as user location. You are of course paying
significantly more for this level of redundancy. There are different ways to
utilize multiple data centers: you could choose to operate a primary region
with a secondary only used for failover; or you could have two or more
regions in regular operation because your clientele are located in different geographies.
Site HA Across Regions
It's important for the web software to cooperate with the
cloud HA model in terms of state. The ideal fit for the cloud's model for high
availability is stateless web server software, meaning any kind of session
state or global state information is stored in an external resource such as a
distributed memory cache or database. As the load balancer distributes traffic
across multiple instances, user session context is maintained—even if the cloud
needs to reallocate a failed instance or increase/decrease the number of
instances in order to auto-scale.
Legacy software that originally targeted a single server and
keeps state in memory won't work in this model. One possible option in this
case is to use a load balancer configured for server affinity ("sticky
sessions"), which will continually route each client to the same instance.
This remedy is not ideal, because cloud instances can always come and go. A
better strategy is to update the software to store state in a distributed
memory cache (such as Redis on AWS ElastiCache / Azure Cache) or cloud database
(such as AWS DynamoDB, Azure CosmoDB, or your relational cloud database).
High Availability for Databases
If you're using a PaaS database service (highly recommended)
for a relational database such as Azure SQL Database or Amazon RDS, those
services will allocate multiple servers with automatic failover. The standby
instances will be kept up-to-date as the primary is updated. In the event of a
failure of the primary, a standby instance takes over. However, to span
multiple availability zones, you'll need to configure your cloud service for
multi-zone availability which will likely cost more. Your cloud service may
also offer features for geo-replication which you should investigate.
What about globally available database solutions? There are
various solutions and patterns for this on Azure
and AWS for
different databases. In general, adapting a traditional relational database to
be global is going to require some work involving replication and
synchronization, and if you're going in this direction you'll need to follow
guidance from your cloud provider specific to that database and carefully
consider the related costs and performance characteristics. You're strongly
advised to follow the prescription for a proven global database pattern rather
than inventing your own solution.
If you're really intent on global presence for your
database, you're advised to consider using one of the newer cloud databases
specifically designed for global scale. These include NoSQL databases like Azure's
Cosmos
DB and AWS's DynamoDB
as well as relational databases like Amazon Aurora Global
Database. There's a great deal of ongoing innovation by cloud providers
around reinventing databases for high availability and scale. With these
databases, there's significantly less work for you to do in configuring and
operating them and there are fewer moving parts exposed to you. Many of them
offer excellent performance and 4 9's availability (or better). Whether one of
these databases is a good fit for you and the kinds of queries you need to
perform is of course an individual consideration you'll have to research.
High Availability for your Services Tier
For traditional HTTP/HTTPS/TCP web services, you can follow
the earlier web server guidance for high availability: configure at least 2
instances in a load-balanced farm, across multiple availability zones. If your
services take their tasks from a queue or database, you can likewise configure
a worker farm of multiple instances across multiple availability zones.
There is another kind of service, however, that takes
advantage of the new server-less computing capabilities cloud platforms are now offering.
Function-as-a-Service services like AWS
Lambda and Azure
Functions are frequently used for Microservices and provide built-in high
availability. There are no pre-allocated servers to fail with your serverless
function: the cloud platform will allocate servers as needed in response to
demand. Availability comes out-of-the-box with server-less computing.
High Availability for Single-Instance Software
We've spent much of this post attesting to redundancy as a
necessary ingredient in providing high availability, but what if there's a
component of your solution that isn't capable of running on multiple instances?
That might be something like a background service or a web service. The answer
to this question lies in whether the component is required to always be
available.
If a single-instance component does not need to be
continually available, it can remain single instance. For example, a background
service that only runs once overnight does not need to always be available. In
that case, loss of the instance due to hardware failure would be replaced by
the cloud infrastructure before long, the component would resume, and all would
be well. The same would be true for any kind of processing worker that takes
work from a non-interactive source such as a queue or database.
However, if a single-instance component must be available in
order for the rest of the solution to function that is another story. For
example, the temporary loss of a web service that processes payments could
cause significant disruption. In this case, the component is a single point of
failure and compromises high availability. In order to realize high
availability, the service must be modified so that it can run multi-instance.
Test Availability—Or Your Customers Will!
High availability needs to be addressed for every tier of
your SaaS application. When you think you have an HA architecture in place, you
should prove it out—with some "sabotage": every recovery mechanism
you think is in place needs to be proven out. For example, deallocate web
server instances while the solution is being used: no one should be disrupted.
If you've made your application resilient against temporary unavailability of a
cloud service, re-configure that cloud service so that it is no longer
accessible: did your planned graceful handling work as intended? If you've
deployed to multiple regions with traffic management and failover, shut down
your site in a region and observe what happens to users.
Monitor Availability
Cloud platforms offer some superb tools and services for achieving high availability. Nevertheless, it's easy to overlook a software component or fail to configure things correctly. Be sure to monitor operations and pay attention to your application's actual availability.
In Conclusion
Availability: it's being there for your clientele. Eliminate single points of failure and think
horizontal in your application architecture: redundancy is your greatest tool
for high availability in the cloud. Make your instances stateless, and code for
resiliency. Above all, take advantage of the newer cloud services where high
availability features are built in and automatic. Operations monitoring is essential.
No comments:
Post a Comment