|
| ( 01 Oct 2002 ) |
| By Nicholas Cravotta, Technical Editor |
|
Fault tolerance isn't about keeping equipment from failing. There's no way you can do that. It's about making the impact of failure as close to zero as possible. Some common synonyms for fault tolerance are "high availability" and "five nines". Various redundant hardware schemes, such as 1-to-1 or 1-to-N redundancy, help improve reliability, but software is also a significant source of single points of failure.
As the team captain of the software side of the design partition, the operating system is in a position to provide a stable foundation and framework for creating reliable systems. In general, an operating system provides the infrastructure for running other applications but doesn't do a lot of processing itself. Its role is to provide windows and access into the internal workings of a system and all of its resources. This role also goes beyond monitoring, to include error recovery and code updates. In reality, the operating system controls only a tiny piece of the overall system in which it resides. For their part, applications control even less. For example, an application or operating system running on a line card has little control over the external resources it uses, such as switches, backplanes, and hard drives.
For this reason, when it comes to fault tolerance, most of the brains reside in middleware- typically, software that runs outside the system components it monitors. The operating system supplies access to system statistics, and fault-tolerant middleware aggregates all the monitoring data of a system in a single place where it can recognize patterns, track trends, resolve problems, and-most important-avoid failures. For the middleware to effectively do its job, however, the operating system and the applications it manages need to open themselves to a little cooperation.
MANAGEMENT FROM THE MIDDLE Middleware can be passive, in which the application and resources send system-status data and alarms on a regular basis or when necessary; active, in which the middleware requests status data; or a mix of both. An operating system or application on a line card has only the system's perspective of what it is like to be on that line card. Middleware can make decisions and trade-offs based on a complete system perspective. For example, a line card might be nearing its "warm" threshold for usage and try to perform load balancing-transferring one or two applications to another resource. However, the middleware knows that every resource is running at this level and so ignores the event.
Using a framework, a developer describes the possible events in a system and the policies for how to react to each event. On a simple level, when an event occurs, middleware implements the corresponding policies. Middleware adds real value in its ability to track and predict failures before they occur. For example, 10 applications using the same hard drive may not individually see enough errors to suggest that the hard drive is becoming unreliable. Middleware, aggregating these errors, can see such a trend and enact a prevention policy, such as switching to a redundant drive.
Effective middleware lets you dynamically monitor each resource based on its current health, adjusting the kind of information it collects and how often it collects it based on the current perceived risk of the component's failing. A working resource might send only an occasional "heartbeat," which lets management middleware know it's alive and well. When an application fails, it no longer responds to the heartbeat, and the middleware initiates a failover using the last good checkpoint. A heartbeat can be a shell around an application, so that the application is unaware of the heartbeat, or it can become more complex, including a self-audited perception of health. When a resource exhibits atypical behavior, the middleware steps up monitoring so it can evaluate the status with more detail. For example, instead of having only the limited resolution of the resource's being active or failing, the resource can say, "I'm overloaded, and something is wrong." Or it can say, "I'm overloaded, but it's a peak time, and I'm still OK." The more resolution necessary to determine the true state of the resource, the more data the middleware collects.
Effective monitoring is a question of balance. You can overmonitor a system in much the same way a manager can overmanage. If your manager requires you to report on every action you take on a project, you'll spend more time writing reports than working. The same is true with system monitoring. The more status information you demand and the greater frequency with which you demand it, the more overhead you add to the system.
You can gather system statistics in several ways. The most common method is to create an independent task that runs with the applications you are monitoring. Using a frequency determined by the overall health of the subsystem, the task collects various statistics, may analyze some of them, and sends them to the middleware for processing. Some operating systems support the use of dynamic breakpoints to control monitoring. An application or the operating system itself could insert a breakpoint in its code to call a custom data-collection function. If you judiciously use this method, it can collect valuable data from fault-tolerant-unaware applications with tolerable levels of interference.
Note that you need not send every problem to the middleware. Latency occurs with escalating alarms up to the middleware and awaiting a response. The operating system or the application can itself try to resolve many problems, doing anything from trying an operation again to initiating its own controlled failover to a redundant resource. Of course, the operating system has to have enough information and understanding about the failure to be able to try to resolve it. In many cases, however, it still makes sense for middleware to manage the problem or at least be informed of the problem and its successful correction, given the other features middleware can offer, such as event logging and pattern tracking.
The operating system's role is effectively opening itself to the middleware. If the operating system is a black box, there is little resolution of the operating system's health or the health of the applications it manages. Middleware can better manage an operating system that provides access to the state of its heaps, sockets, processes, messaging infrastructure, and other critical resources. For example, if memory is running low or CPU usage on a line card is too high, the middleware can transfer applications to another line card instead of failing the entire line card and all of its applications. Thus, the operating system becomes just another resource for the middleware to manage.
HAEDENED OPERATION SYSTEM Given the power operating systems have over application software, they play an important role in monitoring, managing, and preventing application failures. For example, a common problem in multiapplication systems is that an error in one application can cause errors in other applications or destroy the integrity of their data. For example, an invalid pointer can result in memory writes to the wrong parts of memory, causing other applications running on the system to fail for no apparent reason and propagating bugs throughout the system. The most common method of protecting applications from overwriting each other or the kernel is a hardware MMU (memory-management unit). Every access to memory goes through the MMU, which determines whether an application has access rights to that memory. In this way, an MMU prevents an application from overwriting the memory of another application. When a segmentation violation occurs, the MMU can notify the operating system so that it can take action, usually to kill the thread and have the parent process that started it restart it.
A surprising number of operating systems do not support an MMU even when it is available. Certainly, there are advantages to a shared-memory architecture: less overhead in communication between applications because they can write directly into each other's data space, less operating-system overhead to manage applications, and less overall complexity. If an application goes awry, however, it can quickly corrupt the entire system, including the operating system, before transparent recovery can occur.
An MMU alone, however, is insufficient for protecting applications from each other or from the kernel. The kernel runs in supervisor mode, which gives it access to all application spaces; if the kernel can somehow become corrupt, the applications and their data are still vulnerable. For this reason, it makes sense to have your best developers working on code that resides in the kernel/supervisor space and your less reliable developers in "userland," where the MMU can protect against most of their errors.
Figure 1 Consider two tasks that share the CPU 50/50 (a). If one of the tasks spawns another task, sharing shifts to 33.3/33.3/33.3 (b). An errant application or application that wants more than its "fair share" of CPU time by running under multiple threads could keep spawning tasks that eventually starve the rest of the system. Splitting the CPU time of the parent alone (50/25/25) can prevent such starvation (c) (courtesy Green Hills Software). Most operating systems assume that developers of applications understand that other applications will be running in the system and have to cooperate with others in resource allocation. However, an application that abuses its priority causes lower priority applications to suffer. Even applications with the same priority can affect each other. Consider two tasks that share the CPU 50/50 (Figure 1a). If one of the tasks spawns another task, sharing shifts to 33.3/33.3/33.3 (Figure 1b). An errant application, an application that wants more than its "fair share" of CPU time by running under multiple threads, or a virus could spawn tasks that eventually starve the rest of the system. One way in which the operating system can prevent this type of problem is by spawning tasks that split from the CPU with their parents rather than with all tasks (Figure 1c). Another way is by triggering an alarm when the number of tasks currently spawned reaches a specified threshold.
Applications can also deprive each other of memory. Although MMUs prevent memory writes between applications, an errant application (or one with a memory leak) could allocate memory until little is left for other applications. An application can protect itself by allocating more memory than it conceivably needs, but this behavior causes the problem it is intended to avoid. For fault-tolerant systems that stay on for years at time, even slow memory leaks can become major problems. The operating system can detect memory scarcity by triggering an alarm whenever available memory drops below a certain threshold and then taking action. One way to recover memory lost to memory leaks is to occasionally shut down applications, causing a controlled failover to a redundant application. Even though the application loses track of the memory, the operating system should keep track of which blocks of memory are allocated to which applications to correctly program the MMU. When you shut down the application, the system recovers the lost memory, and other applications become free to use it.
Another potential source of failure comes from hackers. If a system is connected to an external network, hackers can attack the system. You may not consider this situation a threat to your system, but, given the expanding scope of networks, it may be worth considering how to protect against attacks. For example, operating systems that put the interface stack within the kernel and, hence, the supervisor space stand the risk of having the kernel corrupted and, thus, having errant code running with full privileges. Moving the stack to its own address space protects the operating system from a corrupted stack. Segregating the stack, on the other hand, may add some overhead and latency to stack operations.
Another "gotcha" involves bounds errors. An operating system may not perform a "bounds check"-that is, check that a value is within a valid range?on incoming parameters on system calls from applications to avoid the overhead of such checking and to speed overall system performance. The operating system assumes that the programmer is extremely careful and knows what he or she is doing, submitting only valid parameters to operating-system functions. Although you might be willing to make this assumption about your own code, it's difficult to have such confidence in code that others write. The more code in a system and the more independent development teams a project involves, the greater the chance of a bug's appearing in fielded code. Thus, if someone submits a bad pointer or an invalid value, the operating system isn't protected from the application. One common method of deterring bugs is to enable bounds checking during development and then to shut it off in the field to increase performance. Prudence would suggest enabling bounds checking for all code running in supervisor mode; not doing so fails to fully protect the operating system from applications and vice versa. Errant supervisor code can cause damage; it can corrupt an application or a fault-tolerance mechanism before the application or mechanism has a chance to react, undermining the availability of the system.
Drivers that run in supervisor mode are likely to cause the most system errors because they have the most system access. Strangely enough, although your application is expected to understand the fault-tolerant framework of the operating system, the drivers that the operating-system vendors supply often do not. To reduce faults caused by drivers, many operating-system vendors are "hardening" their drivers. In some cases, "hardening" means building resource monitoring and the ability to generate events into the drivers. Some drivers go beyond statistics gathering and can check and evaluate their own behavior. The drivers support various levels of failure and may even be able to predict failure. These drivers can reduce failover time by communicating directly with redundant resources, triggering a failover, and then telling the event manager. Thus, they avoid the added latency of telling the event manager about the failure, and the event manager then triggers the failover to the redundant resource.
Several operating-system vendors are touting resource reclamation. The concept is that if you shut down an application, the operating system automatically reclaims all of the application's allocated resources (as opposed to requiring the application to release all resources, which may be impossible if the application is corrupted). Reclaimed resources include memory, message queues, and buffers. Reclamation can employ several methods. An operating system can maintain a registry of resources allocated to an application. Another method is leasing. An application that leases a resource renews its lease before it expires; if the application fails, the lease expires, and the resource is free for another application to use. Having too short a lease means loading the system with lease renewals. Having too long a lease increases the time it takes to reclaim a resource upon failure. Some leasing schemes automatically renew leases whenever an application uses the resource, thus reducing system loading from renewals. (Use of the lease implies a desire to keep the lease active.)
Another way to protect resources is to monitor them. Some operating systems offer hooks into the kernel to monitor system parameters, such as CPU usage, number of processes running, and available memory. A developer can set thresholds for each parameter. These thresholds trigger events when something exceeds the threshold. For example, checking the memory threshold every time an allocation is made would significantly impact system performance, so the frequency of checking is adjustable. Typically, monitoring takes place as an independent task that queries the operating system at a set frequency to check the various thresholds in place. One problem with passive monitoring is that response time to a failure (from delayed discovery to delayed action taken) is longer than if an application signals the middleware itself.
When a threshold is exceeded, the monitoring task triggers an event in the kernel. Some operating systems support a publish-and-subscribe model that allows the various applications and tasks to subscribe to those events they need to know about. For example, if the system is running out of memory, an application can free up or scale back usage of memory. Thus, a single event could result in the sending of several alarms, allowing the system to handle the event on many levels.
GETTING THE WORD OUT An important function of an operating system is managing communication between applications, as well as with the operating system and middleware. With MMU support, message passing becomes more complex than simply writing in another application's memory space. Messages must pass through code- usually, the kernel running in supervisor mode- to cross MMU-enforced memory-space boundaries. The messaging mechanism cares not what the message says because it is protocol-agnostic- only that it is encapsulated in the proper format. Security of messages is often left to the applications.
Having a messaging infrastructure in place can enhance the reliability of fault-tolerance mechanisms. For example, writing directly between applications, especially ones running on different hardware, becomes a single point of failure if the link between the two applications goes down. A messaging infrastructure can provide and manage a redundant communication path, as well as confirm that messages are received. Messaging redundancy also improves the reliability of successful checkpoint collection and alarm escalation.
A messaging system can also abstract the messaging process from a physical-port to a logical-port basis. Thus, an application need not be aware of the several available communication ports over which it can send messages, or what path a message will take. Centralized systems need redundant point-to-point connections; decentralized systems need alternate paths across the system. You can set up redundant/alternate paths across different media (in-band, serial, InfiniBand, control bus, and others) to eliminate the communication channel as a single point of failure and ensure that messages get through. As you remove and add ports, the messaging system handles all rerouting and redundant path setup.
By passing all I/O through the messaging framework, the operating system can better monitor the health of an application. If an application is sending messages, then it is still active, and a heartbeat request to see whether the application is alive may be superfluous. Also, applications no longer have to handle the details of port status. Instead of an application's implementing a time-out scheme for a port, the messaging framework can attempt to repair the problem; reroute the message; pass an alarm back up to the application if the message cannot be sent, so the application isn't hung up waiting for a response that is never coming; or even initiate an alarm up to the middleware. Remote debugging and monitoring become easier as well; the messaging framework can log or mirror all messages sent to an application for later evaluation.
A framework that supports a subscription model allows any application to subscribe to events, such as a hot swap or a controlled failover, and receive appropriate messages. For example, a fault-tolerant-aware application may want to shut down cleanly and transfer tasks to another application running on different hardware before you remove the hardware or shut down the application. If the application cannot shut down cleanly, it can then escalate the hot-swap alarm it received up to the middleware with a description of the problem it is having. Another example is that the middleware may suspect that a resource is in danger of failure. It can then subscribe to all events relating to the resource to make a more detailed and direct evaluation of its own.
Note that the operating system rarely dictates fault-tolerance policies. The operating system provides a means of monitoring the system state and registering events or alarms to notify applications and middleware. In most cases, middleware manages and implements policies for each event or alarm. Applications that are fault-tolerant-aware can first attempt to solve their own problems before escalating the problem up to the middleware.
Getting all the components in a system to work together and create a compatible fault-tolerance infrastructure can be challenging. With proprietary systems being the norm, it can be difficult to get equipment from multiple vendors to work together. To help ease this integration problem, the Service Availability Forum is working on several interface standards. Until these standards are ready, however, you need to determine which middleware vendors to support. Note that, although the brochure might list many features that the middleware supports, the vendor may offer only a subset of those services for any platform. Also, if your operating system of choice doesn't support a feature, the middleware won't effectively support it either.
USERLAND The biggest challenges to fault tolerance are the applications themselves. Applications are your biggest unknown, and even the operating system doesn't have control over them. Applications can run on a fault-tolerance framework, but they are then more like black boxes with little resolution of their actual health beyond "active" and "failed." Many effective fault-tolerant mechanisms require awareness of the mechanisms to use them. In other words, applications need to participate in the monitoring of system health.
Applications fail for a variety of reasons, including hardware failure. The simplest method for recovery is to execute a cold boot: Simply restart the application as if someone had just turned on the power. If the application restarts quickly and can again become active in the system, this method may be sufficient. For many applications, however, the loss of data and interruption of service make this strategy a last resort.
A more common approach is "state checkpointing," or "checkpointing," transparently tracking the state of an application by mirroring its memory resources and register state. When an application runs, it has a certain amount of stateful information, including buffers, processing in progress, tables built, and so on. An application manager regularly collects this information, but not so regularly as to overload the system. Depending upon the footprint, too frequently "setting" a checkpoint can create significant overhead. If the current application fails, a redundant application can resume operation from the last good checkpoint.
One problem with a passive checkpointing is that it loses all processing and messages received between the failure and the last good checkpoint; checkpoints immediately become out of date. Users may also perceive a glitch in service as the redundant application resynchronizes with the rest of the system by responding to messages that it has already responded to or not responding to messages that were confirmed as received (but lost during the failover).
Figure 2 With replica-based redundancy, an application need not know that it has been replicated. All messages to the application are broadcast to a pool of replicas via a gateway that intercepts all of an application's messages. Each replica accordingly acts on messages in exactly the same way. The gateway passes outgoing messages only from the "current" replica. Failover is as simple as changing which replica is considered current. A thin engine between gateway and application can queue messages for delayed processing (courtesy Eternal Systems). An application that is aware of fault-tolerant mechanisms can reduce the impact of failover by actively participating in the checkpoint process. For example, an application that is aware of checkpointing can manage its own mirroring, providing a map to that information it needs to re-create its state. This situation reduces the footprint of the checkpoint, as well as the overhead to capture it, enabling middleware to more frequently take the checkpoint. Additionally, the application could have a mechanism for improving failover response, such as sending a partial checkpoint to update the current checkpoint before shutting down, as well as a mode in which the application recognizes that it is trying to recover from an out-of-date checkpoint.
Currently, the highest level of off-the-shelf software redundancy available comes from replica-based frameworks that run just below the middleware layer. The system communicates with the application through the messaging system (Figure 2). All messages to the application are actually broadcast to a pool of replicas of the application via an application controller or gateway that intercepts all of an application's messages. Each replica runs on a different CPU, and each receives the incoming message and takes action in exactly the same way. For outgoing messages, each replica sends a message to the gateway. Because each message should be exactly the same with only a latency difference, the gateway then sends out one copy of the message to be processed. Failover is as simple as changing which replica is considered "current." Note that this type of redundancy protects applications against even random memory errors or undetected hardware failure. At some point, the error will manifest itself in a message to an external resource; this message differs from the messages other replicas send. You can then employ majority voting to resolve the error: The gateway compares the results from all replicas and terminates those replicas in the minority, along with their errors.
One advantage of replication is that, because all interaction between the application and resources external to the application takes place through the messaging mechanism, the application need not know that it has been replicated, and the replicas need not know that they are the "current" active versions. One disadvantage is the framework may significantly affect the latency of an application, given that an application can no longer directly interact with resources but has to work through the messaging infrastructure and gateway. Additionally, replicas may not be colocated in the same physical location. (For maximum fault-tolerance, they shouldn't be.) This situation results in longer latencies, depending on the speed and reliability of the communication infrastructure over which messages pass. Also, consider that the replica framework needs to be redundant so that it is not a single point of failure.
To reduce latency, the replica framework can drop down to a less robust architecture in which the gateway manages a single "current" replica of the application for efficiency and queues incoming and outgoing messages for replicas for later processing and comparison. If the current replica fails, the system stalls until a redundant replica can resynchronize its state by processing its message queue. Compared with a standard checkpointing system, which can lose all messages and processing since the last checkpoint, this method loses no messages; neither the application nor the service is aware of the failover, only of a slight delay during resynchronization.
Applications can also assist with monitoring a system's health by opening hooks into themselves via an API to the middleware. Middleware provides a set of services to help enable applications to be fault-tolerant. But, in many cases, the application must be aware of the services to use them. At a basic level, the application can simply let management middleware know that it is alive and well through a heartbeat.
WHAT GOES WHERE? The application can also be more active about failures. Although the operating system or middleware can handle many of the details of a failover, an application itself can best handle the details of failing over. When an application encounters a problem that it cannot resolve, it can escalate an alarm, along with detailed information about the problem, including what steps it has taken. On the flip side, the application can register for certain events, such as "warm-memory threshold reached" or "imminent shutdown," and take action that either mitigates the problem or reduces its impact.
Building a system using an existing fault-tolerant framework can help flesh out many of the issues you need to address to provide reliable fault tolerance. In the end, however, you still have to determine what events to watch for, write the code to recognize those events, and script the policies that define how to handle them. Although the middleware vendors may say that it's simple to create policy files, you still have to understand what policies to create.
For example, it's one thing to define a checkpoint and another to determine what you really need to do to make checkpointing effective. Additionally, an application may be corrupt for some time before the corruption manifests itself. But how do you define corruption? Even more important, how do you implement a test of that definition? A more robust application would occasionally perform a self-check to discover corruption before it manifests itself. How far do you need to go to keep your system up and running, and what level of service interruption can you tolerate?
Implementing fault tolerance means understanding all the potential single points of failure- both hardware and software and both physical and logical- in a system. The operating system provides statistics, and middleware collects them, but neither can tell you what the statistics mean. You still have to define what constitutes an event and what action to take. If you want to prevent failures, you have to be able to describe and recognize the first signs of potential failure. You also need to account for the dynamic aspects of your system: The signs of failure during peak usage differ vastly from the signs during slow times. Also consider the added dimension of how you handle errors- with the application, operating system, or middleware- and how you escalate alarms from one level to the next. And don't forget the need to devise recovery modes for all the types of failure you didn't think of but have to account for.
ACKNOWLEDGMENTS
Again, the operating system and middleware cannot provide the specifics of your fault-tolerant events and policies; you can't get away from being an expert on the fault tolerance of your own design. Many vendors provide documentation and example frameworks that may give you ideas about ways to improve reliability. However, if the operating system is too rigid and gives you insufficient hooks to deeply monitor your system, the mechanisms you implement will have only limited effectiveness.
 You can contact Technical Editor Nicholas Cravotta at (1) 510-558-8906, Fax (1) 510-558-8914 E-mail ednnick@pacbell.net
|
| |
|
|
|
|
| |
|
|
Average Rate:
No rating yet |
| |
| |
|
|
|
|
|
|
|
|
|
|
| |
|
|
| |
|
| 6/1/2009 |
|
| 1/1/2009 |
|
| 18/12/2008 |
|
| |
|
|
|
|
|