Building a disaster recovery framework: Step One

January 11, 2007

The first question that Schneier asks in his security framework is:

  • What assets are you trying to protect?

This question maps almost directly into a disaster recovery framework. It is the first step in building a solid understanding of the problems that you are trying to solve, and it is also the step that is most often glossed over. Of course, cataloguing all of the data that is being managed in a data center is not just daunting, but often nearly impossible. But it is very reasonable to expect that a data center administrator already has at least a broad idea of the amount and categories of data and applications that are in play. Arranging that information, and going to the additional trouble of tabulating some of its characteristics, will give a clearer picture of what a disaster recovery plan should address.

There are two very broad categories of assets that should be considered in a data disaster recovery framework: the data itself, and the availability of that data. It is not intuitively obvious to most people why the two should be considered separately, but doing so can result in significant opportunities for cost savings and simplifications. Two different data assets with the same criticality may have very different availability requirements, which may in turn lead to two different disaster recovery mechanisms. (Where do applications fit? My simplistic approach is to treat them as a type of data, with a specified availability requirement, and I haven’t run into a problem with that approach.)

I can’t give a one-size-fits-all answer to asset categorization. Ever data center has a different set of users and requirements. I can say this, though: don’t start with the disaster recovery categories. Start with a “natural” categorization, by users, applications, or any other appropriate view. Clusters of these natural categories will lead to a better set of disaster recovery categories.

For each natural category, the following data should be gathered:

  • Who owns this asset?
  • Who uses this asset?
  • What is the direct value of the data?

    By direct value here, I mean: is there a specific monetary value attached to this data? How much did it cost to get it? What would it cost to reconstruct it, if that’s possible? Does it represent actual money, e.g. a customer transaction?

  • What is the monetary penalty for loss of access to that data, stated as an hourly rate?

    An additional question here is: is there a period of time during which loss of access incurs no penalty? This question is not about when the loss occurs, since you can’t plan to ensure that failures only happen during off-hours, but about how often access is absolutely required, i.e. how long scheduled tasks can be deferred without penalty during an outage.

  • What are the workload requirements on this data?

    The question of how much bandwidth is required in a particular category is often missed in a first pass review, and it is crucial for making decisions about the type of disaster recovery solution that is appropriate. Clearly the more accurately the workload is specified, e.g. by average bandwidth, average update rate, burst bandwidth, burst updates, unique update rate, etc., the more accurately you can plan, but, again, diminishing returns. At a minimum, average bandwidth and update rate specified in a ballpark range is required.

  • What are the indirect values of this data?

    This question is the trickiest one to answer, as it is not always expressible in monetary terms. Consider, for example, student data at a university data center. There is clearly not much monetary value there, but the effects of loss of either availability or data can have drastic impacts, both on individuals and the data center. Lost theses, delayed homework assignments, frustrated students: all of these are a part of the Pinto problem: failure is more enduring than success. The political implications to the data center should be weighed, in this case.

    As a different example, loss of customer transactions during an outage can be considered a penalty based on the direct value of an asset. But even a brief outage can result in a much higher indirect cost: loss of customers. Again, failure is more enduring than success.

If I go on and on about second-order effects and indirect values, it’s because that’s where I most often see real problems manifest, precisely because that’s where planners have failed to look. Addressing these kinds of problems will be the topic of a later post, but as a teaser: it doesn’t always have to involve significantly more cost, or even a different DR approach. Often, the right solution involves just publicly acknowledging the existence of a technical problem and publicly apologizing. Really. A little humility goes a long way towards erasing failure from the public mind.

If a data center administrator can answer these questions for each of the “natural” categories of assets under their control, they’ve gone a long way towards building a disaster recovery plan. The next step will be to use these natural categories to build a DR matrix, but that’s a subject for a later post.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: