Server High Availability/Fault Tolerant Solutions, Architecture

1 Comments
Join the Conversation
Fault Tolerant Computers Use Redundant Hardware - Sundeip Arora
Fault Tolerant Computers Use Redundant Hardware - Sundeip Arora
Types of HA and FT. Failover clusters and lockstep CPUs. Features, specifications, technologies, compatibility, and how to choose.

Redundant hardware is used to create highly reliable computer systems. The idea is that a single hardware component failure will not disrupt services. This is especially important for mission-critical or safety-critical systems. For example, NEC rates their Express5800/ft server at 99.999% (five nines) uptime, a downtime of about five minutes a year.

High Availability Versus Fault Tolerant Computers

The terms "high availability" (HA) and "fault tolerant" (FT) are used to refer to high reliability computers, often interchangeably. However some vendors use the terms in specific ways. For example Stratus, Vmware, NEC and others define fault tolerant computers as systems that can continue working without interruption (continuous availability) if one hardware component fails.

In contrast, high availability systems can continue working if one hardware component fails, but only after some disruption in service as the system automatically recovers. NEC says that, "A high availability system is characterised by an operational time above 99%." This works out to a downtime of less than 3.6 days a year.

The basic idea behind HA/FT systems is "no single point of failure." A few different "no single point of failure" architectures are examined below.

Enhanced Availability Computers

Standard servers can be purchased with the following configuration, at a cost of only a few thousand dollars:

  • Redundant power supply.
  • Redundant network cards.
  • Redundant hard disks (RAID 1 or RAID 5).
  • ECC RAM to detect and correct transient RAM errors.

These "enhanced availability" servers (there is no standard term) are more reliable than a standard PC that has no redundancy. However the CPU and RAM are single points of failure. If they fail, the server will crash. More sophisticated and expensive HA/FT servers can tolerate RAM, CPU and other failures.

Failover Clustering High Availability Servers

Cluster servers use two separate servers for redundancy. If one server goes down, the other server takes over (failover). There is some downtime, but only seconds or minutes. They are also called "failover clusters" to differentiate them from high performance clusters (such as the Linux Beowulf). For example, Microsoft calls their solution Windows Server 2008 Failover Clustering.

Advantages

  • Possible to use standard servers.
  • Relatively cheap.
  • Possible to detect and recover from operating system crashes.
  • The failover controller function is distributed between both redundant servers, so there is no single point of failure for the controller.

Disadvantages

  • Some downtime during the failover process.
  • Special "cluster-aware" software is usually required.
  • Depending on the design, it is possible for " split brain " error conditions to occur. This is when both redundant servers think that they are the active server. Both try to access the same storage (SAN or shared SCSI) and this causes data corruption.

Examples

  • Microsoft Windows Server 2008 Failover Clustering.
  • Sun Solaris Cluster.
  • Red Hat Cluster Suite (GNU/Linux).

Lockstep CPU Fault Tolerant Servers

Lockstep CPU systems use two separate CPUs that run the exact same instruction stream, either in the same server or in separate servers. There are called "lockstep" because the two CPU's are running in lockstep. Stratus explains it as, "Lockstep technology. Replicated, fault-tolerant hardware components process the same instructions at the same time. In the event of a component malfunction, the partner component is an active spare that continues normal operation. There is no system downtime and no data loss."

Advantages

  • No downtime when switching from faulty to backup CPU.
  • Can run standard software applications and operating systems, without modification.

Disadvantages

  • Proprietary hardware required.
  • Might not detect and automatically recover from software crashes.

Examples

  • HP (Tandem) NonStop.
  • Stratus ftServer.
  • NEC Express5800/ft.

How to Choose the Right High Availability System

Aside from physical servers, virtual machine versions of failover cluster servers and lockstep CPU solutions are available, for example from VMware. Server load balancers can also be used as a high availability solution.

Choosing a HA/FT solution means looking into the details of software compatibility and the type and level of protection provided. Many solutions emphasize server reliability only. It is important to look at the entire system (network, power supply, software) for single points of failure.

Scott Schnoll has a good introduction to Microsoft Cluster Server. The various vendors have more information on their HA/FT products: HP NonStop Servers, Stratus ftServer, NEC Express5800/ft, Marathon, VMware etc.

Photo of Kit Mun, Yuen Kit Mun

Yuen Kit Mun - Kit Mun is a self-confessed information junkie, reading an average of a book a week over the past two decades. His growing Internet ...

rss
Advertisement
Leave a comment

NOTE: Because you are not a Suite101 member, your comment will be moderated before it is viewable.
Submit
What is 7+8?

Comments

Jun 7, 2010 10:16 AM
Guest :
Nice job with the fault tolerance overview- it can be a hairy subject. I work at Stratus and keeping the terms straight in an ever-changing industry is hard.
I was wondering if you wanted to talk about calculating the cost of downtime for your readers, and how to determine the level of availability an IT admin would need for his mission critical applications.
Thanks for the article!
1
Advertisement
Advertisement