So
you want 5 9’s?
An
essay by Warren Montgomery (wamontgomery@ieee.org)
Many computer hardware, software, application, and integration companies are beginning to advertise solutions that offer very high availability. More than a few are beginning to claim “5 9’s” or 99.999% availability, the traditional benchmark of the core elements of the public telephone network. Be aware that actually achieving availability anything close to this level is a lot more complicated than simply buying the right server or software package, and many of the claims being made today are based on very limited models that ignore many things that could leave your applications out of service. This essay explores what 5 9’s really means and what is really required to explore it.
99.999% availability has been used as a benchmark by the telecommunications industry in specifying high availability for many years. This translates to a system being unavailable for less than about 5 minutes out of every year, a truly high standard. In fact the telecom benchmark is actually better than this, less than 3 minutes per year of downtime, and specifications for new systems are now being written for availability exceeding “6 9’s”.
In the case of communication systems, availability usually translates to the ability to initiate new calls and have them correctly routed to the desired endpoint. Thus an availability of 5 9’s means the network is functioning correctly 99.999% of the time. Networks and applications can be unavailable for many reasons. In general faults, either errors in hardware or software, environmental conditions like loss of power, or human error, cause failures, which cause the system to be unable to perform correctly. Failures are detected either automatically or by human intervention and the system initiates recovery actions which restore service using spare or repaired equipment. The system is usually unavailable during the time between the failure and the completion of the recovery activity. Thus getting very high availability depends on many things:
If a system fails to do any of these things, the system availability will suffer.
A question that often arises is what kinds of faults and failures are counted against the availability of a system. For some systems, “planned” downtime may be acceptable, meaning that the system can be taken out of service for repair or upgrade and it will not be counted as unavailable. This is typical of systems like aircraft controls which do not need to function when the airplane is not in the air. For many Telecommunications systems and defense applications this is not the case, and anything which interrupts service is counted as being unavailable. Many system providers do not design for this kind of continuous availability requirement, considering it acceptable if the system is “5 9’s available” while it is in service, but requiring that the system be periodically taken out of service for repair or upgrade.
Another way to look at what is included is to consider what kinds of faults and failures occur. A computer system requiring external power will clearly be taken out of service if that power supply fails, and even if it has battery backup capability it can only survive for a limited time without an external power source. Likewise failures of cooling or ventilation systems, communication links, or even collapse of the building that houses a system can cause it to become unavailable. Are these kinds of failures within the scope of a claim of “5 9’s”? Not for the computer or software manufacturer, who cannot control these kinds of events, but if you are trying to support an application that is required to be continuously available, they are within the expectations of your customers.
The implications of this are that the availability of your complete system will often depend on many more factors than the high availability systems, software, and applications you install, and you have to analyze the complete picture, up to and including spare systems in another location to take over. (No high availability hardware or software systems within the world trade center survived the terrorist attack, yet most firms with offices there had foreseen this kind of problem and had backup capabilities at other locations that allowed them to maintain availability). This also means that the time to detect and recover from these kinds of errors will also count in your availability and thus to achieve availability of “5 9’s” over all, you may need to demand even higher availability of the underlying components.
First, be assured that systems do achieve it. Data on the telephone network in the United states reveals that the switching systems (dominantly made by Lucent Technologies and Nortel Networks) actually outperform their availability requirements. This was not an easy milestone and took time to achieve.
High availability systems will typically have a design limit for availability based on the fault detection capability and the spare capacity available, and come closer and closer to meeting that design target over time, as more underlying faults are discovered and repaired. This is a process that takes hundreds or even thousands of years of cumulative experience in the field, and an dedication to understanding and analyzing the failures and faults. To understand why, consider that for each kind of fault and resulting failure, the time to detect and repair will be different. It would be very convenient if the time to detect and repair were constant or subject to an upper limjt, but unfortunately this is not the case. Some failures take either a long time to detect and diagnose or a long time to repair. Worse yet, you can examine the contribution of each type of failure to the total downtime of the system, and what you discover in the case of telecommunications and I expect any other application area is that the very rarely occurring very long time to repair failures contribute as much or more total downtime than those failures that have rapid detection and repair. The lore of telephone systems is full of horror stories about corrupted backup tapes, failed batteries, killer cable cuts, and other extremely rare events that caused outages lasting for days. Fortunately for us, most of the failures are detected and repaired much more quickly.
Understanding how to detect and repair those rare failures that take out a system for an hour or more is difficult, mainly because they occur so infrequently. In a system achieving 5 9’s, a failure requiring an hour of unavailability will happen only once in twelve years of field experience. Much of making the system meet it’s goals is a matter of observing these kinds of failures and implementing new detection and repair procedures that reduce the outage interval from hours to minutes or seconds, but because they occur so infrequently you will not have that chance until the system has years of experience in operation.
Another item worth noting is that repairing a failure often requires a greater disruption of service than the failure itself. In order to repair a telephone switch that is not accepting new calls, it may be necessary to tear down existing calls. In order to repair a burned out chip on a circuit card that is effecting only a few customers, the entire card (or even the entire system) may need to be taken down to make a swap. This leads to a fundamental tradeoff between “correct” operation, and “continuous” operation. Telephone networks often opt to accept a certain amount of incorrectness (e.g. calls occasionally misrouted or blocked because of faulty data, or noisy connections because of a faulty transmission system), while maintaining availability for most customers. Financial systems make the opposite priority call – If they cannot guarantee that a transaction will be performed correctly, the system refuses it and becomes unavailable.
The implication here for anyone looking for high availability is that you should not expect to achieve those levels immediately. Instead they have to plan for “reliability growth”, a period during which the system unavailability will exceed its design limit and during which failures need to be observed, analyzed, and mitigated. It is also important to understand your priorities, whether availability is more important than correctness when priority calls must be made.
5 9’s availability is a truly extraordinary target. By contrast most computer servers achieve only 2 9’s availability, limited by the reliability of commercial power and by operations practices that take them off line periodically for maintenance. “high availability” servers achieve 3 to 4 9’s through sparing, backup power, and maintenance operations that do not interrupt services, but few systems come close to 5 9’s.
As noted above, much of achieving high levels of availability is about detecting, diagnosing, and recovering from failures. Simply having the spare equipment isn’t enough unless you can figure out when, and how to use it. This applies not only to the system software and hardware, but also to the applications. It does little good to insure that the hardware is operating correctly if your operation is failing to serve your customers because its data has been corrupted. Most high availability computer systems provide ways to monitor the sanity of an application, but the application has to take advantage of them. Putting a buggy script based web service on a high availability server will not make the resulting service high availability.
Reaching 5 9’s requires an extraordinary dedication to anticipating, analyzing, and mitigating failures. A simple anecdote here is worth noting. In the mid 1990’s, AT&T and it’s then equipment manufacturing arm (now Lucent Technologies) replaced the central processor in AT&T’s 4ESS switches. The replacement was largely transparent to telephone customers, an exceptional feat by itself in that this was the control computer routing all of AT&T’s telephone calls, and each of the over 120 systems replaced had thousands of electrical connections to the switching equipment. The processor implements a unique custom instruction set designd in the 1970’s with very different technology. What illustrates the dedication to availability required though is a problem that developed during replacement. These systems have error correcting memory which reports the errors that occur, in order to understand whether there are patterns that show developing faults. AT&T had specified limits for how many of these corrected errors should occur in a system, and the new processor was exceeding them. The reason for the larger than expected number of corrected errors was traced to the discovery that the substrate in the memory chips used in the machines had a higher than specified level of contamination with uranium, and the particles being produced by normal decay of that uranium were causing “soft” errors (changing the sate of a bit) that were caught and corrected. What is amazing is that a specification had been written during the design for uranium contamination in the chips, and the manufacturer had failed to meet it. The memory chips were replaced (at a significant cost), and the errors disappeared. Note that the designers of this system anticipated the problem of memory errors due to radioactive decay and wrote a requirement into the materials to mitigate it, and that even though none of these corrected errors had any impact on service engineers traced the problem and insisted on a replacement. High availability systems are full of tales like this of extraordinary thought in design followed up by analysis of every failure and no expense spared in mitigating it.
Anyone considering an application requiring extreme reliability should approach the design and implementation of the system with a great deal of care.
Start by understanding the whole system, what availability means for it, and what is going to be required.
With a good idea of what availability needs you have overall and a good idea of what the components need to deliver, you can make much better choices of services, applications, and platforms to meet them. Start looking at the hard questions
In addition to making sure your suppliers are ready to help you achieve 5 9’s, be sure you are ready for a long term commitment to achieving high availability. Systems must be designed to be constantly monitored, analyzed, and improved over time. Human operators need to be trained and retrained to operate them correctly. Designs need to be upgraded to reflect learning from the inevitable failures.