Reliability vs. “High Availability”
In the past couple of weeks, two servers that LSCR manages have had serious hardware problems. One was a new server located in the campus data center; the other was a very old server located in the basement of Campbell Hall. In both cases, users experienced functional problems and downtime over an extended period. Coincidentally, both problems were related to the machine’s disk controller.
Server-class computers are generally quite reliable. Disk drives are the most common hardware failure, and even those have MTBF (Mean Time Between Failures) of 300,000 hours or more. That’s a pretty big number (about 35 years), although when you consider that a major server often has 12 or more disks, you can see that a typical server has a decent chance of seeing a disk failure during its normal lifetime. We build in redundancy for disk failures; using RAID (Redundant Array of Inexpensive Disks), we can install an array of disks and set them up so that no single disk failure will take the server down or cause data loss. Similarly, we use redundant power supplies on different circuits to avoid local power problems; the campus data center has both a UPS (Uninterruptible Power Supply) to handle short power grid outages, and a diesel generator to handle longer outages. It can keep running as long as there are still trucks to deliver gas.
That level of redundancy brings us up to something like 99.9% uptime. 99.9% (referred to in the industry as “three nines”) sounds like a lot, but it’s equivalent to having your server down for a little more than an hour a month, or one full day a year. When that downtime is planned, it’s not too bad, but when it’s unplanned, it can be a huge disruption to the departments using our servers.
“High Availability” is an industry term generally used to refer to systems designed for availability of “three nines” or above. To get to “four nines” (99.99% uptime, 1 hour of downtime per year) or “five nines (99.999% uptime, 5 minutes of downtime per year) requires a much larger investment in hardware. A typical configuration will include wholly redundant hardware, including spare servers that don’t do anything except wait for another server to fail. In front of that might sit a hardware load balancer, which makes the multiple machines look like one server to the outside world. Then you might have redundant network paths, with two or more different Ethernet connections going to two or more different routers, which have different fiber-optic connections to different service providers.
With all this stuff, you have to evaluate how much it would cost relative to how much additional uptime you would gain. For our operation, we don’t really have the funding to go above three nines, and in most cases it’s not really necessary; there are campus services which provide higher availability for someone who needs four or five nines. This is why we can offer a free web hosting service to departments, when IST’s service costs $30/month; IST’s service has a more robust (and therefore more expensive) infrastructure that we can’t hope to replicate on the cheap.
We will continue to look for ways to make our servers more reliable, and to improve our disaster recovery procedures. In these cases, if we had migrated the files on the servers to our NetApp storage device, we could have relatively easily brought the services back up on a different piece of hardware. However, our NetApp itself isn’t designed for high availability–it has redundant disk, but the controller is a single point of failure. This is an example of the kind of thing you have to deal with to build a highly available system; not only do you have to build redundancy into all of your hardware, but you have to build redundancy into everything it connects to, also. Otherwise you’re just moving the point of failure.
Both of our servers are back running normally right now. We’re accelerating our migration off the old one, and trying to improve our recovery procedures on the new one. Unfortunately, the fact that we’ve already had 8 hours of downtime this year doesn’t mean it can’t happen again; all we can do is learn from the history and try to plan for the next problem.
