Where I’ve Been

June 17, 2008

It lives!

One day a long time ago, during a discussion of data retention best practices in photography at The Online Photographer, the question came up, “Why does it always come down to money? Why and how should I put a price tag on my personal photos?” Generally speaking, that’s a common reaction. And although the response, “Money/value is simply a means of expressing the utility of data retention for a given piece of data,” is a good answer for the why, the question of the how has been eating away at me. At the risk of being labeled a capitalist Philistine, I’ve finally come up with a thought experiment that answers it.

Imagine yourself as a photographer. A shadowy stranger approaches you and offers to buy one of your photographs at a price to be negotiated, but there’s a catch: he wants to buy the entire photograph. Everything. The negative, any prints you’ve made, all digital files containing it. It will be as if you’d never taken that picture, except for the memory in your head. How much money would it take for you to be willing to completely forsake that picture?

Now the experiment takes a twist. You are that shadowy stranger, and you are really your own rich doppelgänger. Your one desire is to completely assume the life of your twin, and you are willing to expend quite a bit of time, energy, and money to do so. Those photographs represent memories and incidents that you need in order to become complete. So, for that photograph, how much are you willing to spend?

Done by a fair and honest negotiation with yourself, the price that you and your doppelgänger arrive at represents the personal value, in monetary terms, of that photograph. Iterate through your entire archive, and you’ve arrived at the “cost” of your data.

You might take a shortcut by dividing your photographs into categories, each category containing pictures of roughly equal value. You could then pick a representative of each category, negotiate for it, and multiply the cost by the size of that category.

The cost you arrive at gives you a rough value to plug into your personal utility tradeoff of time, money, and energy spent on data retention. If your pictures are worth $500, you probably shouldn’t even bother. If the value’s $10,000, maybe you should be running backups. At $100,000, I’d sure as heck be doing off-sites. And although my bias is towards digital data retention practices, the same thought experiment applies to film. It’s only the techniques that differ.

Boy, this post is much drier than I thought it would be. Any reactions to the experiment or thoughts on improving it would be appreciated.

Bugga-bugga-bugga

January 12, 2007

I was once sitting with a client in the offices of a major operating system development company, which shall remain unnamed, for now. The program manager for an “Advanced” new version of the OS was presenting information to try and convince us that the company had overcome its reputation for unreliable software. At one point, he put up a pie chart, representing data they’d collected on a broad sample of actual failures.

He pointed towards the largest slice of the pie. “As you can see, 85% of on-site, production failures were the result of user errors, so we can ignore those.”

My mind just flipped. I didn’t say anything, since I was supposed to be cooperating in a multi-company project, but I was gibbering up and down my mental halls.

85% of their failures were the result of user errors. And they didn’t see that as a problem.

All I’m going to say now is, I’m sure glad I use a Mac.

UPDATE: OK, it’s not really fair for me to use this example as a Mac promotion. All I’m really saying is that for the program manager of a major operating system to stand in a room full of high-powered disaster recovery professionals and state that 85% of their downtime could be ignored because it was caused by stupid users was astounding. It demonstrated a degree of cluelessness that was mind-boggling.

So a colleague says to me, “You know, every example you give to me of a second-order effect seems to me to be either political or social. Why don’t you just call it the ‘social value’ of the data?”

I sputtered, “No, no, they have real economic impact. Look at the Pinto…”

“Blah, blah, blah. So? Social problems cost money, too.”

The more I think about it, the more I like his term. I hold that in a disaster recovery planning exercise, there are two types of second-order effect:

  1. The social value of the asset, which is the part that most often, in my experience, does the real biting, and
  2. The expanded failure domain of the asset, which are the applications that depend on the applications that depend on the data. It’s worth noting here that strictly speaking, there is no such thing as an “expanded” failure domain: failure domain membership is transitive. But I still call it out explicitly, to make sure that it is accounted for.

The first question that Schneier asks in his security framework is:

  • What assets are you trying to protect?

This question maps almost directly into a disaster recovery framework. It is the first step in building a solid understanding of the problems that you are trying to solve, and it is also the step that is most often glossed over. Of course, cataloguing all of the data that is being managed in a data center is not just daunting, but often nearly impossible. But it is very reasonable to expect that a data center administrator already has at least a broad idea of the amount and categories of data and applications that are in play. Arranging that information, and going to the additional trouble of tabulating some of its characteristics, will give a clearer picture of what a disaster recovery plan should address.

There are two very broad categories of assets that should be considered in a data disaster recovery framework: the data itself, and the availability of that data. It is not intuitively obvious to most people why the two should be considered separately, but doing so can result in significant opportunities for cost savings and simplifications. Two different data assets with the same criticality may have very different availability requirements, which may in turn lead to two different disaster recovery mechanisms. (Where do applications fit? My simplistic approach is to treat them as a type of data, with a specified availability requirement, and I haven’t run into a problem with that approach.)

I can’t give a one-size-fits-all answer to asset categorization. Ever data center has a different set of users and requirements. I can say this, though: don’t start with the disaster recovery categories. Start with a “natural” categorization, by users, applications, or any other appropriate view. Clusters of these natural categories will lead to a better set of disaster recovery categories.

For each natural category, the following data should be gathered:

  • Who owns this asset?
  • Who uses this asset?
  • What is the direct value of the data?

    By direct value here, I mean: is there a specific monetary value attached to this data? How much did it cost to get it? What would it cost to reconstruct it, if that’s possible? Does it represent actual money, e.g. a customer transaction?

  • What is the monetary penalty for loss of access to that data, stated as an hourly rate?

    An additional question here is: is there a period of time during which loss of access incurs no penalty? This question is not about when the loss occurs, since you can’t plan to ensure that failures only happen during off-hours, but about how often access is absolutely required, i.e. how long scheduled tasks can be deferred without penalty during an outage.

  • What are the workload requirements on this data?

    The question of how much bandwidth is required in a particular category is often missed in a first pass review, and it is crucial for making decisions about the type of disaster recovery solution that is appropriate. Clearly the more accurately the workload is specified, e.g. by average bandwidth, average update rate, burst bandwidth, burst updates, unique update rate, etc., the more accurately you can plan, but, again, diminishing returns. At a minimum, average bandwidth and update rate specified in a ballpark range is required.

  • What are the indirect values of this data?

    This question is the trickiest one to answer, as it is not always expressible in monetary terms. Consider, for example, student data at a university data center. There is clearly not much monetary value there, but the effects of loss of either availability or data can have drastic impacts, both on individuals and the data center. Lost theses, delayed homework assignments, frustrated students: all of these are a part of the Pinto problem: failure is more enduring than success. The political implications to the data center should be weighed, in this case.

    As a different example, loss of customer transactions during an outage can be considered a penalty based on the direct value of an asset. But even a brief outage can result in a much higher indirect cost: loss of customers. Again, failure is more enduring than success.

If I go on and on about second-order effects and indirect values, it’s because that’s where I most often see real problems manifest, precisely because that’s where planners have failed to look. Addressing these kinds of problems will be the topic of a later post, but as a teaser: it doesn’t always have to involve significantly more cost, or even a different DR approach. Often, the right solution involves just publicly acknowledging the existence of a technical problem and publicly apologizing. Really. A little humility goes a long way towards erasing failure from the public mind.

If a data center administrator can answer these questions for each of the “natural” categories of assets under their control, they’ve gone a long way towards building a disaster recovery plan. The next step will be to use these natural categories to build a DR matrix, but that’s a subject for a later post.

On Diminishing Returns

January 10, 2007

One more note before I get into the DR framework itself:

I am a fan of comprehensive planning. I am also an advocate, although decidedly not a fan, of process for the control of plans, as long as that process allows for sufficient flexibility. But in any planning, and in any process, you must be aware of the law of diminishing returns.

Sure, it would be comprehensive to account for every type of file that every user might require. Sure, a process could require every user, or even every organization, to file a disaster recovery “flight plan” for every type of file that you have added to your catalog. Sure, it would be comprehensive to consider every eventuality, up to and including a nuclear strike on your data center.

And sometimes, I’m sure, while I’m writing about this disaster recovery framework, that may seem like what I’m advocating. Let me assure you, though, that I am not. Planners, you see, have a tendency to forget to consider the cost of the planning, itself, at least until the deadlines loom. And in an idealized world, where planning and cataloguing doesn’t cost anything, it makes sense to be as comprehensive and complete as possible.

What you should do, then, in applying this framework, is to rationalize the planning costs with the benefits. Do as much planning as you can reasonably afford to. Get the users and organizations involved as much as they can afford to. But don’t overburden yourself or others by planning at too fine a grain, or for eventualities that are too far beyond the pale.

As a closing example: I had a client for whom I built a large extended cluster with remote storage mirroring. We spent a great deal of time and money designing a solution that could survive a site failure, and automatically fail over after thirty seconds. The solution was finally deployed in early 2001, at the World Trade Center in NY.

On September 11, the system worked flawlessly, failing over and restoring secondary site access after about thirty-five seconds. Or so we were told, five days later, when the system administrators finally got around to attempting to use the remote site.

Keep that in mind.

Update: actually, another important lesson to be learned from that example is that we were planning for the last disaster. (’93 WTC bombing) Our technical solution worked for both, but the appropriateness of a near real-time failover was always pretty questionable.

The Perils of Chargeback

January 6, 2007

I shall soon post a discussion of the first question in Schneier’s framework, and how it relates to disaster recovery, but as I thought about it, I realized that I both wanted to discuss chargeback, and that the points that I wanted to raise would inappropriately dominate the discussion. So I’m separating chargeback into this post.

When dealing with the problem of categorizing data from a broad set of organizational entities, the chargeback approach is quite tempting. Forcing your organizational customers to make their own value judgements about the availability and reliability of their data both eases your role and can lead to better assignments. After all, your customers are the ones who understand their data requirements best (hopefully). However, your role as disaster recovery expert should lead you to be cautious in relying solely on chargeback. You need to be the individual (or set of individuals) who is responsible for taking a broad, holistic view of disaster recovery planning.

In particular, there are two ways that chargeback can lead to poor disaster planning:

  1. Your organizational customers are not experts. They may make poor value judgements in an effort to spare their budgets, and
  2. Many organizations do not operate as profit centers, yet are crucial to the operation of the overall company.

When establishing a chargeback process, you must make sure that you are available for DR consultation, and proactive about communicating risks. Use the DR framework to establish guidelines, not just for first-order categorization, but for the process of examining second-order impacts. As an example, don’t just establish the basic three tiers (mission-critical, business-critical, non-critical) of data and ask each organization to wedge their data into them. Ask the businesses to run through the scenarios of what would happen in each failure case. What other systems would be affected? Who else would have visibility into the failure? How would schedules be affected? Would customers be affected?

You should be spending a great deal of time worrying about these questions: dreaming up failure cases, guessing at their likelihood, and running through scenarios. You should take the opportunity to share the pain. (Side note: I don’t know about you, but I find it actually depressing to spend too much time thinking about failures, and I also find that talking about them with other people cheers me up a little. Although it probably depresses them. I can be a real downer at parties.)

As far as budgets go, a chargeback system should be sensitive to the size of an organization’s budget relative to the importance of its data. An R&D group, for example, may not be given much budget, but the intellectual property contained within its systems may represent huge value for the company. If the budget masters of the company dole out funds without earmarking anything specific for disaster recovery, the temptation of the managers is to direct their funds towards resources that may eventually pay off: e.g. research. Nobody I know, when they get their budget requests, says, “Great! Now I can finally afford to have my data remotely mirrored!” I’m in favor of discounting chargebacks based on whatever criteria you find appropriate, but other approaches can be used.

A laissez faire chargeback policy can be a great cover-your-ass strategy. When disaster strikes, as long as the system that you’ve designed behaves as you’ve predicted and advertised, you can defer all responsibility for any unpleasant side effects. If an organization loses critical data, you can say, “They should have paid for better storage.” If the company is impacted by the losses of a single division, you can say, “They should have had a bigger budget.” But here’s where my squishy liberal side comes into play: the people who play the role of disaster recovery managers should consider themselves responsible for the welfare of the entire company. They should have the flexibility to set policy with the good of the business in mind, and they should be responsible for evangelizing comprehensive and rational disaster recovery planning.

Your job shouldn’t begin and end with the design of storage systems. You should be a shepherd, guiding your company through stormy nights and the occasional hurricane.

On Five Nines

December 16, 2006

People discussing reliability often toss around the claim that their particular system meets the “five nines” criterion: most often interpreted as “this system will experience only five minutes of downtime per year”. The use of this metric is an excellent example of the Pinto problem, in that it preys upon people’s weakness when it comes to probability.

Five nines is a useful sales metric. It is not a useful disaster planning metric, in that, even if the claim is accurate, it only measures the average behaviour of the system. Disaster recovery planning can only be done well if we understand both the average case and the worst case scenario.

So, let’s assume that we have a system that meets the five nines criterion. If we were to assign a variable D to the expected downtime of that system, we’d be able to write a simple equation: E(D) = 5.26 minutes. What does this equation tell us? Only that, if we have a large number of these systems, we should expect their average downtime to be about five minutes.

What we really need to know, though, is what we should expect when a failure does occur. If F is a variable that denotes a failure event, we’d like to know E(D|F), which is rarely given by vendors. In words: once a system has experienced a failure, how long will it take to recover? The chances are quite good that, in a five nines system, it’s going to be much longer than five minutes. Why? The reliability metric E(D), from which the five minutes claim is drawn, represents the average case, which includes many systems in which no failure ever occurs.

Of course, for good disaster planning, we still do need P(F), which is one (incorrect) way to interpret the five nines claim. We would like to know, when selecting a system, which is most likely to experience a failure. But confusing P(F) with E(D), which is what the common interpretation of five nines does, leads to much unnecessary acrimony between customers and vendors, as well as poorly executed disaster planning.

Using five nines as a metric leaves a lot of room for variation in failure characteristics. If an average system fails eleven times in a year, but each failure results in only thirty seconds of downtime (think path failover), then a vendor is still justified in claiming five nines. But the same metric can be claimed if, for every hundred systems sold, only one of them experienced a failure event that resulted in eight hours of downtime. Planning for eleven path failovers is very different from planning for significant downtime.

One more post before I let the Nyquil kick in and drag me off to bed.

There are two primary reasons why I think the somewhat arbitrary distinction between security and disaster recovery is important.

  1. Organizational Boundaries: whether it is a good idea or not, there are often lines drawn in IT department org charts which boil down to: group A is responsible for security issues, while group B is responsible for storage issues. In a case like that, each group should be aware of the impact that their decisions and designs have on the other group. Of course, a good security team should always be monitoring the gestalt of the data center, but a disaster recovery team should have just as wide a set of responsibilities.
  2. Organizational Goals: to a large extent, security planning involves a starting point of trying to prevent breaches. Disaster recovery professionals need to start from the opposite viewpoint: assume that a disaster is inevitable, examine the impact of that disaster, and examine the costs of mitigating that impact. So for example, if a security risk can result in the corruption or loss of data, a security professional will start by examining ways to avoid that risk. A disaster recovery professional, on the other hand, will start by assuming that the system was breached, and data was potentially corrupted.

Of course, the line is still fuzzy. Security analysis should always assume that any engineered solution will have undiscovered flaws, and seek ways to mitigate the breach of any single component (“defense-in-depth”). Likewise, disaster recovery planning should always consider ways to prevent any particular negative event from escalating into a disaster: e.g. the simple step of adding a UPS to handle building power failures. But a large part of the distinction, as it exists in my mind, rests on the starting points. (Maybe it goes like this: security pros start from the optimistic viewpoint that they can fix the problems, while disaster pros are pessimists who just know that something bad is bound to happen…)

More fear

December 16, 2006

Regarding my last post: I just wanted to draw one more parallel that I think is very important. Schneier talks quite a bit about the politics of fear, and the push to develop “theatrical” security solutions. I see the same sort of dynamic in the disaster recovery field, with aggressive salespeople using fear to sell the latest, greatest, and most expensive solutions.

What is needed in disaster recovery planning is a sensible, sober approach. Schneier’s framework, appropriately tuned for the particular set of problems inherent to computer DR, provides the tools needed to dispassionately measure both the need for disaster recovery and the value of the proposed solutions. Lacking that, all you are left with is a bunch of salespeople shrieking, “But you might lose data!”