## On Five Nines

### December 16, 2006

People discussing reliability often toss around the claim that their particular system meets the “five nines” criterion: most often interpreted as “this system will experience only five minutes of downtime per year”. The use of this metric is an excellent example of the Pinto problem, in that it preys upon people’s weakness when it comes to probability.

Five nines is a useful sales metric. It is *not* a useful disaster planning metric, in that, even if the claim is accurate, it only measures the average behaviour of the system. Disaster recovery planning can only be done well if we understand *both* the average case *and* the worst case scenario.

So, let’s assume that we have a system that meets the five nines criterion. If we were to assign a variable D to the expected downtime of that system, we’d be able to write a simple equation: E(D) = 5.26 minutes. What does this equation tell us? Only that, if we have a large number of these systems, we should expect their average downtime to be about five minutes.

What we really need to know, though, is what we should expect when a failure does occur. If F is a variable that denotes a failure event, we’d like to know E(D|F), which is rarely given by vendors. In words: once a system has experienced a failure, how long will it take to recover? The chances are quite good that, in a five nines system, it’s going to be much longer than five minutes. Why? The reliability metric E(D), from which the five minutes claim is drawn, represents the *average* case, which includes many systems *in which no failure ever occurs.*

Of course, for good disaster planning, we still do need P(F), which is one (incorrect) way to interpret the five nines claim. We would like to know, when selecting a system, which is most likely to experience a failure. But confusing P(F) with E(D), which is what the common interpretation of five nines does, leads to much unnecessary acrimony between customers and vendors, as well as poorly executed disaster planning.

Using five nines as a metric leaves a lot of room for variation in failure characteristics. If an average system fails eleven times in a year, but each failure results in only thirty seconds of downtime (think path failover), then a vendor is still justified in claiming five nines. But the same metric can be claimed if, for every hundred systems sold, only one of them experienced a failure event that resulted in eight hours of downtime. Planning for eleven path failovers is very different from planning for significant downtime.

I should also note that if we are given accurate numbers for both E(D) and E(D|F) (reliability and mean time to recovery), then we should be able to calculate P(F), and develop a much more robust plan than we can with either metric on its own.