Redundant hardware is used to create highly reliable computer systems. The idea is that a single hardware component failure will not disrupt services. This is especially important for mission-critical or safety-critical systems. For example, NEC rates their Express5800/ft server at 99.999% (five nines) uptime, a downtime of about five minutes a year.
High Availability Versus Fault Tolerant Computers
The terms "high availability" (HA) and "fault tolerant" (FT) are used to refer to high reliability computers, often interchangeably. However some vendors use the terms in specific ways. For example Stratus, Vmware, NEC and others define fault tolerant computers as systems that can continue working without interruption (continuous availability) if one hardware component fails.
In contrast, high availability systems can continue working if one hardware component fails, but only after some disruption in service as the system automatically recovers. NEC says that, "A high availability system is characterised by an operational time above 99%." This works out to a downtime of less than 3.6 days a year.
The basic idea behind HA/FT systems is "no single point of failure." A few different "no single point of failure" architectures are examined below.
Enhanced Availability Computers
Standard servers can be purchased with the following configuration, at a cost of only a few thousand dollars:
- Redundant power supply.
- Redundant network cards.
- Redundant hard disks (RAID 1 or RAID 5).
- ECC RAM to detect and correct transient RAM errors.
These "enhanced availability" servers (there is no standard term) are more reliable than a standard PC that has no redundancy. However the CPU and RAM are single points of failure. If they fail, the server will crash. More sophisticated and expensive HA/FT servers can tolerate RAM, CPU and other failures.
Failover Clustering High Availability Servers
Cluster servers use two separate servers for redundancy. If one server goes down, the other server takes over (failover). There is some downtime, but only seconds or minutes. They are also called "failover clusters" to differentiate them from high performance clusters (such as the Linux Beowulf). For example, Microsoft calls their solution Windows Server 2008 Failover Clustering.
Advantages
- Possible to use standard servers.
- Relatively cheap.
- Possible to detect and recover from operating system crashes.
- The failover controller function is distributed between both redundant servers, so there is no single point of failure for the controller.
Disadvantages
- Some downtime during the failover process.
- Special "cluster-aware" software is usually required.
- Depending on the design, it is possible for " split brain " error conditions to occur. This is when both redundant servers think that they are the active server. Both try to access the same storage (SAN or shared SCSI) and this causes data corruption.
Examples
- Microsoft Windows Server 2008 Failover Clustering.
- Sun Solaris Cluster.
- Red Hat Cluster Suite (GNU/Linux).
Lockstep CPU Fault Tolerant Servers
Lockstep CPU systems use two separate CPUs that run the exact same instruction stream, either in the same server or in separate servers. There are called "lockstep" because the two CPU's are running in lockstep. Stratus explains it as, "Lockstep technology. Replicated, fault-tolerant hardware components process the same instructions at the same time. In the event of a component malfunction, the partner component is an active spare that continues normal operation. There is no system downtime and no data loss."
Advantages
- No downtime when switching from faulty to backup CPU.
- Can run standard software applications and operating systems, without modification.
Disadvantages
- Proprietary hardware required.
- Might not detect and automatically recover from software crashes.
Examples
- HP (Tandem) NonStop.
- Stratus ftServer.
- NEC Express5800/ft.
How to Choose the Right High Availability System
Aside from physical servers, virtual machine versions of failover cluster servers and lockstep CPU solutions are available, for example from VMware. Server load balancers can also be used as a high availability solution.
Choosing a HA/FT solution means looking into the details of software compatibility and the type and level of protection provided. Many solutions emphasize server reliability only. It is important to look at the entire system (network, power supply, software) for single points of failure.
Scott Schnoll has a good introduction to Microsoft Cluster Server. The various vendors have more information on their HA/FT products: HP NonStop Servers, Stratus ftServer, NEC Express5800/ft, Marathon, VMware etc.
Join the Conversation