Not everyone is clear on the distinctions between grid computing and cloud computing, so let's begin with a brief explanation of each. While grid computing and cloud computing are not the same thing, there are many synergies between them and using them together makes a lot of sense.
Grid computing is about tackling a computing problem with an army of computers working in parallel rather than a single computer. This approach has many benefits:
- Time savings: a month of processing work for a single computer could be achieved in a single day if you had 30 computers dedicated to the problem. The largest grid computing project in history, the Search for Extraterrestrial Intelligence SETI@home project, has logged 2 million years of aggregate computer processing time in only 10 years of chronological time by leveraging hundreds of thousands of volunteer computers.
- Less expensive resources: You can use less expensive resources to get work done instead of buying large servers with maximum grade processors and memory. Granted. you have to buy more of them--but the smaller, cheaper machines are more easily repurposed for other uses.
- Reliability: A grid computing system has to anticipate the failures or changing availability of Individual computers and not let that prevent successful completion of the work.
Not all types of work lend themselves to grid computing. The work to be done is divided into smaller tasks, and a loosely-coupled network of computers work on the tasks in parallel. Smart infrastructure is needed to distribute the tasks, gather the results, and manage the system.
Not surprisingly, the early adopters of grid computing have been those who needed to solve mammoth computing problems. Thus you see grid computing applied to such things as genetics, actuarial calculations, astronomical analysis, and film animation rendering. But that's changing: grid computing is getting more and more scrutiny for general business problems, and the onset of cloud computing is going to accelerate that. Computing tasks do not have to be gargantuan to benefit from a grid computing approach, nor are compute-intensive tasks the only kind of work eligible for grid computing. Any work that has a repetitive nature to it is a good candidate for grid computing. Whether you're a Fortune 500 corporation that needs to process 4 million invoices a month or a medium-sized business with 1,000 credit applications to approve, grid computing may well make sense for you.
Grid computing is a decade older than cloud computing, so much of today's grid computing naturally doesn't use a cloud approach. The most common approaches are:
- Dedicated machines: purchase a large number of computers and dedicate them to grid work.
- Network cycle stealing: repurpose other machines in your organization for grid work when they are idle, such as overnight. A business workstation by day can be a grid worker at night.
- Global cycle stealing: apply the cycle stealing concept at worldwide scale over the Internet. This is how the SETI@home project works, with over 300,000 active computers.
Cloud computing allows for an alternative approach to grid computing that has many attractive characteristics, offers a flexible scale-up/scale-down as you wish business model, and already provides much of the supporting infrastructure that traditionally has had to be custom-developed.
Cloud Computing and Microsoft's Azure Platform
Cloud computing is about leveraging massive data centers with smart infrastructure for your computing needs. Cloud computing spans application hosting and storage, as well as services for communication, workflow, security, and synchronization. Benefits of cloud computing include the following:
- On-demand scale: you can have as much capacity as you need, virtually without limit.
- Usage-based billing: a pay-as-you-go business model where you only pay for what you use. There is no long-term commitment and you not penalized if your level of usage changes.
- No up-front costs: no need to buy hardware or keep it maintained or patch operating systems. Capital expenditures are converted into operating expenditures.
- No capacity planning needed: you don't need to predict your capacity, as you have the ability to adjust how much resource you are using at will.
- Smaller IT footprint and less IT headache: capabilities such as high availability, scalability, storage redundancy, and failover are built into the platform.
Microsoft's cloud computing platform is called Azure, and currently it consists of 4 primary service areas:
- Windows Azure provides application hosting and storage services. Application hosting means running software such as web applications, web services, or background worker processes "in the cloud"; that is, in a cloud computing data center. Applications are load-balanced, and run as many instances as you wish with the ability to change the number of instances at a moment's notice. Cloud storage can provide file system-like Blob storage, queues, and data tables.
- SQL Data Services provides a full relational database in the cloud, with largely the same capabilities as the SQL Server enterprise product.
- .NET Services provides enterprise readiness for the cloud. Service Bus interconnects multiple locations or organizations with publish-subscribe messaging over the Internet. Access Control Service provides enterprise and federated security for applications. Workflow Service can execute workflows in the cloud.
- Live Services provides a virtual desktop, data and web application synchronization across computers and devices, and a variety of communication and collaboration facilities whose common theme is social networking.
Azure is new; at the time of this writing, it is in a preview period with a commercial release expected by end of year 2009.
Putting Grid Computing and Azure Cloud Computing Together
Azure is designed to support many different kinds of applications and has no specific features for grid computing. However, Azure does provides much of the functionality needed in a grid computing system. To make Azure a great grid computing platform only requires using the right design pattern and a framework that provides grid-specific functionality. We'll look at the design pattern now and in Part 2 we will explore a framework that supports this pattern.
The first thing you'll notice about this pattern is that there is some software/data in the Azure cloud and some on-premise in the enterprise. What goes where, and why?
- The cloud is used to perform the grid computing work itself. The use of cloud resources is geared to be temporary and minimize cost. When you're not running a grid computing solution, you shouldn't be accruing charges.
- The enterprise is the permanent location for data. It is the source of the input data needed for the grid to do its work and the destination for the results of that work.
The software actors in this pattern are:
- Grid Worker: The grid worker is cloud-side software that can perform the task(s) needed for the grid application. This software will be run in the cloud as a Worker Role in multiple instances. The framework uses a switch statement arrangement so that any grid worker can perform any task requested of it. Grid workers run in a loop, reading the next task to perform from a task queue, executing the task, and writing results to a results queue. When a grid worker has no more queue tasks to run, it requests to be shut down.
- Grid Manager: the grid manager is enterprise-side software that manages the job runs of grid computing work. There are 3 components to the grid manager:
o Loader: The loader's job is to kick off a grid application job run by generating the tasks for the grid workers to perform. The loader runs in the enterprise in order to access on-premise resources such as databases for the input data that needs to be provided for each task. When the loader runs, the tasks it generates are written to a Task Queue in the cloud.
o Aggregator: the aggregator reads results from the results queue and stores them in a permanent location on-premise. The Aggregator also realizes when a grid application's execution is complete.
o Console: the console is a management facility for configuring projects, starting job runs, and viewing the status of the grid as it executes. It can provide a view similar to a flight status display in an airport, showing tasks pending and tasks completed.
The data actors in this pattern are:
- Task Queue: this is a queue in cloud storage that holds tasks. The Loader in the enterprise writes its generated tasks to this queue. Grid workers in the cloud read tasks from this queue and execute them.
- Results Queue: this is a queue in cloud storage that holds results. Grid workers output the results of each task to this queue. The Aggregator running in the enterprise reads results from this queue and stores them durably in the enterprise.
- Tracking Table: this is an enterprise-side database table that tracks tasks and their status. Records are written to the tracking table by the Loader and updated by the Aggregator as results are received. The tracking table enables the console to show grid status and allows the system to realize when a grid application has completed.
- Enterprise Data: the enterprise furnishes data stores or services that supply input data for tasks or receive the results of tasks. This is organization and project-specific; the code written in the Loader and the Aggregator integrates with these data stores.
Walk-through: Creating and Executing a Grid Computing Application on Azure
Let's put all of this together and walk through how you would develop and run a grid computing application from start to finish using this pattern and a suitable framework:
1. A need for a grid computing application is established. The tasks that will be needed, input data, and results destinations are identified.
2. Using a framework, developers add the custom pieces unique to their project:
- A Grid Worker (Azure Worker Role) is created from a template and code is added to implement each of the tasks.
- A Loader is created from a template and code is added to implement reading input data from local resources, generating tasks, and queuing them to the Task Queue.
- An Aggregator is created from a template and code is added to implement receiving results from the Result Queue and storing them on-premise.
3. Azure projects for application hosting and storage are configured using the Azure portal. The Grid Worker package is deployed to cloud hosting, tested, and promoted to Production.
4. Using the Grid Console, the grid job run is defined and started. This starts the Loader running.
5. The Loader reads local enterprise data and generates tasks, writing each to the Task Queue.
6.The Grid Worker project in the Azure portal is started, which spawns multiple instances of Grid Workers.
7. Each Grid Worker continually receives a new task from the Task Queue, determines the task type, executes the appropriate code, and sends the task results to the Results Queue. The way Azure queues work is very useful here: if a worker has a failure and crashes in the middle of performing a task, the task will reappear in the queue after a timeout period and will get picked up by another Grid Worker.
8. The Aggregator reads results from the Results Queue and writes them to local enterprise storage.
9. While the grid is executing, administrators can use the Console to watch status in near real-time as Grid Workers execute tasks.
10. When the Aggregator realizes all scheduled tasks have been completed, it provides notification of this condition via the Console. At this point, the grid has completed its work and its results are safely stored in the enterprise.
11. The Grid Workers are suspended via the Azure portal to avoid incurring any additional compute-time charges. Cloud storage is already empty as all queues have been fully read and no additional storage charges will accrue.
Value-Add of Azure for Grid Computing
The Azure platform does good things for grid computing, both technically and financially:
- Cost Conscious: the use of cloud-hosted applications avoids the need to purchase computers for grid computing. Instead, you pay monthly for the Grid Worker compute time and queue storage you use. The design eliminates ongoing costs for compute time or storage time once a grid application has completed processing.
- Scalability and Flexibility: You can have as much capacity as you want or as little as you want. Your grid computing application can run on as small a footprint as a single Grid Worker instance.
- Reliability: The reliability mechanism built into Azure Queues ensures all tasks get performed even if a Grid Worker crashes. If a Grid Worker does crash, the Azure Fabric will start up a replacement instance.
- Coordination: The Worker Role-queue mechanism is simple, load balanced, and works well. Using it avoids the need to write complex coordination software.
- Simplicity: this pattern for grid computing on Azure has simplicity at its core. Roles are well-defined, no element of the software is overly complex, and the number of moving parts is kept to a minimum.
In Part 2, we'll see how this pattern is implemented in code using a grid computing framework developed for Azure.