5 Disaster Recovery ‘hidden surprises’

Amy Hawthorne, VP, global marketing at iland, outlines five things you should be mindful of when it comes to ensuring your Disaster Recovery plan isn’t well…a disaster.

In the world of Disaster-Recovery-as-a-Service (DRaaS), most of us evaluate providers and solutions based on multiple factors we consider to be important such as fail over speeds, how cost effective the options are, and geographic diversity.

But there’s other things you need to be mindful of – little surprises just waiting in the wings that if left unchecked, could cause your organisation significant disruption and hinder any kind of effective business continuity strategy.

Surprise #1: Did someone forget the network engineer?

A lot of DR managers tend to focus on replication, but the big surprise often happens when they’re ready to actually deploy and the network engineer is brought along for the very first time.

The replication and hypervisor selection are a fairly automated process, but the network staff really require insight from the network engineering team to make sure they are ready.

Specific aspects to consider are:

How are end-users going to communicate with the DR site?
Will they need to set up a site-to-site VPN tunnel?
Does this have the potential to drop an additional leg or MPS cloud off the DR site? Or, will it just be internet facing services – like a website and email server is open to the internet at the DR site when they fail over.

This last scenario only concerns public services as there needs to be a planned DNS record. Around 50% of organisations run their DNS services through a third party and this means they have a website they go to and update. Nice and simple.

However, some organisations who own their own DR still require the help of an engineer to figure out if they should failover these DNS servers to their domain registries to do the updates, or have a solution with their DR provider to host the DNS for them.

Things can get more complex however, if you’re planning to do a partial failover rather than a full failover. By that we mean if you have a DR plan that says, ‘in the event of a problem in production, fail everything over the DR’.

In theory, that’s a straightforward and pretty simple DR plan, but customers are increasingly looking for a solution where they can fail over just an application instead of failing the whole data centre. They may have mission-critical applications and want to fail over just that component while still enabling their end-users to communicate.

This situation is when the network engineer becomes very critical. Working with your DR provider will help to make sure there’s a solution in place for that communication to seamlessly flow over to the DR site when that application has failed over.

In the scenario where the end-user has failed over one application, the DR provider would work through the plan with their networking team to make sure that production continues to operate normally, and the application that has moved over can communicate with production and with the end-users.

The main advantage of a partial failover is that you’re not failing over all your data centre but only the one application that’s been impacted. And when you’re ready to fail it back to the production site, you can stage, plan and test the fail back to make sure it’s ready to come back nice and neat during a maintenance window where nobody is impacted. So instead of taking two downtime windows, or two downtime events, you only have one and then you have a graceful failback.

Surprise #2: Wait…what’s that box in the corner?

Gartner claims most organisations are at least 75% virtualised but what’s happening with the remainder? Not being 100% virtualised means there is a physical system somewhere in your organisation.

Most of the replications that people talk about in disaster recovery services are on virtual systems around hypervisor-based replication but there are a handful of systems in the corner, and for good reason.

Firstly, systems can’t be virtualised without massive upgrades or rewrites and often organisations aren’t comfortable in investing in these or cannot prioritise.

Secondly, these systems are sometimes bound by a licensing problem that keeps the technology on a physical server. We’ve seen that over and over again with the largest corporate vendors where they refuse to license their database or whatever to operate in a virtual environment. As a result, everything is living in the physical space for a while.

And finally, these systems are too important to simply be shut down. Most organisations leave these systems on – but don’t enjoy managing half a dozen old machines.

There’s usually a good reason for this: they are used to run critical systems. The irony is, when you ignore physical systems in your DR plan, you’re ignoring some of those critical systems in your environment which should be included in your DR plan.

So, you will need to make sure the network on the DR side is able to bring these physical and virtual systems together. The last thing you want when you’re failing over on the DR side is a situation where you have production set up one way but, on the DR side, everything is slightly different.

Especially in the case of a natural disaster, you probably don’t have all your notes and diagrams, you may not have your binder that includes your DR plan and everything about your environment, so it’s really critical when bringing the physical workload into the DR plan to make sure you can build your DR plan so everything is similar if not identical to production – and that includes the network.

Surprise #3: The auditors are coming! The auditors are coming!

The auditors are a remarkably diligent group of people who will want to know what happens to your workloads in the cloud.

If for any reason you fail over your systems in the cloud, once or multiple times, this means that your systems are hanging out in that cloud for a while. That period of time is not a get-out-of-audit free pass, this is actually an opportunity to provide additional paperwork!

Your auditor is going to want to ensure that all the security technologies that you have carefully put in place on premises were equally carefully put in place in the cloud. So how will the auditors know you’ve done it? They won’t just take your word for it, they’ll need documentation to prove that your security tools were in place and running.

Auditors are paid by the hour so if they come knocking at your door, you probably won’t want to take 48 hours for your cloud vendor to research and get you the documentation, because they’re going to sit there on the clock and that costs money. In order to prove that everything was up and running for that operation, what you will need is the easy ability to get all the reports you need.

Surprise #4: Where did you put that panic button?

The thing with emergencies is that they don’t tend to happen on a schedule. It’s very straightforward to test your DR system anytime you want but if you actually have a disaster – whether on a large or small scale – usually those things are unpredictable.

So while you may have 12 hours to make a decision on a hurricane, you might not have that time when a files patch goes through. One of the problems, particularly with large scale disaster, is that you find yourself handling both your professional and personal responsibility.

If there’s going to be a major weather event or earthquake, you and most of your staff are probably worried about your families, your own personal house and your pets… and probably second to that, you’re worried about the IT systems under your care and that’s certainly the right set of priorities. However, what that means is that you find yourself needing to press that big red panic button but not being anywhere near it.

There’s no guarantee that everybody involved in this failover is actually going to be anywhere near a professional environment, and in fact, you wouldn’t want them to be! In this situation, there should always be a backup plan.

(Pleasant) Surprise #5: What else could I use this for?

The nice surprise is that, by managing your environment from your mobile, you can also set up alerts around billing, performance and also security, so you can have the same level of manageability you would have on-prem, and not be hit with any nasty surprises.

Believe me, if you get to grips with these hidden surprises, you will get a handle on the critical aspects of a DR scenario and be super confident when you have to failover your systems.