The terrible devastation to homes and businesses in Southern Alberta from flooding is a potent reminder that putting off planning for the worst will eventually carry a price. Disaster Recovery and Business Continuity Planning, as phrases, are everywhere and on everyone’s to-do list but as a practice can sometimes be misunderstood.
There are a few areas that you should start to define when you are thinking about your Information Technology and Services Disaster Recovery Plan (DRP):
Firstly, you will likely want to set a scope for what your DRP will cover – there are a million “what-ifs” that could be explored but it’s typically wiser to plan for fire and flood before looking at what a Martian invasion could mean to your business. Once a scope and some severities are defined, you can move on to prioritizing which systems will be needed most and what you are able to survive with during an interim period. Finally, you will want to define how the restore will happen and then simulate a given disaster scenario. This process will be iterative and permanent. As you add new systems, new locations, and new people you must continually test and refine your DRP. These components of planning will bridge into a larger Business Continuity Plan as your DRP matures and the scope of the plan develops. As technology becomes more intertwined with business operations, it is increasingly difficult to separate being able to do business from making sure IT services are online when you need them.
Typical initial planning on a DRP will focus solely on the redundancy of the server and network systems that the business currently uses. This planning is extremely important and should definitely be a high priority but this is not, in itself, a Disaster Recovery Plan. A server or component in the IT environment failing is just one possible disaster. Having redundant servers, a backup device, and two Internet connections is little help when water rushes into the room that they are all stored in. This is where defining “disaster” begins. Flood, fire, theft (whether theft of physical IT components or theft of data), and electrical failure are common starting points and can serve as a mental exercise in beginning to understand what role IT plays in the business and what could potentially fail. Doing the work required to define a recovery from a fire will mirror a lot of the work needed for any disaster that destroys your primary server location.
Once you’ve picked your list of disasters you should try to assign a likelihood and severity to each. If your servers are running offsite at a location with fire suppression and secured access you might want to rate fire and theft as lower risk but if this same location has communications equipment in the basement, flood may be of a higher likelihood and severity. Verizon ran much of its fibre optic network in the basements of Manhattan towers and was severely impacted during Hurricane Sandy. This is an example of the thorough, creative type of thinking needed when considering disaster likelihoods. A business or server room on the fourth floor of a tower can still be exposed to flood risk whether that is because of water cutting off power below or because a water main bursts on the fifth floor above.
With a list of potential disasters and attached likelihoods and severities, you can start to look at prioritizing the availability of your varying IT services. Many businesses rely much more heavily on e-mail to function than they realize and will want to make this a key recovery deliverable within hours of a disaster. Being able to communicate with team members, vendors, and customers is even more critical during a disaster. This is a component that can be mitigated in advance by using services that route mail through offsite servers that can be accessed using a web browser when your primary e-mail server is offline. After e-mail is restored, it is likely you will want to look at what is required to restore data and access to your central line-of-business application or your ERP system. Each firm will have unique requirements but it is much better to decide across business functions in advance about which services are needed most instead of having that argument in a chaotic environment after a disaster that will already be taxing human and financial capital.
Getting this far is a big step but the most important thing of all is to continually test and refine your DRP. A simple fire simulation can expose key weaknesses in your plan and helps to prep your key DRP stakeholders about what their role will be in any future recovery efforts. Organizations have realized some very basic but critical flaws during simulations; the DRP is a digital Word document and was on the servers destroyed or all the cell phone numbers for recovery team members are stored on the mail server that is no longer online. It’s a lot easier to laugh at these mistakes when you find them in the aftermath a simulated fire and not the real thing.
For any operation that doesn’t yet have a DRP this can all seem like a lot of work but the most important step in any plan is to just get started. Talk to your IT team or service provider and plan even just a single disaster scenario and take action on some of the deliverables you discover. Offsite backups or failover Cloud services can be small expenses when compared to the risk of your business being offline for days or weeks.