With the festive season officially set to begin next week, Chris Wellfair, projects director at Secure I.T. Environments, offers his top tips for ensuring a smooth running holiday season.
Christmas and New Year are fast approaching, and whether you are on call or not, we all want a quiet time for the data centre, and to avoid any unnecessary drama. So, how do you make that happen and what are the biggest risks?
Over Christmas, when many are away from work, we want to know that the data centre, and everything it represents for the business, is safe. Moreover, if a problem does occur, we want to find out about it quickly, so we can act accordingly.
To achieve this, we need monitoring systems, processes and people. You may think all of this is in place, but when did you last test them, and how many levels of human redundancy do you have? Who has codes, spare keys, what is the rota? Are everybody’s contact details correct, with several numbers for those on call? How will you know if the alert systems fail, and how will you manage each disaster that could befall the data centre?
This is the time of year, not to confirm that the processes are written down and accessible, but that they are known to all that have a part to play in protecting the data centre. It is time to test.
In the same way that the military or an airline pilot will go through simulations and pre-flight checklists, the same approach should be taken for IT staff that will be on call over the holiday season. They should know the steps to take with each incident that could occur and just as importantly understand the path of escalation if the problem is worse than initially thought or deteriorates beyond their skillset.
You no doubt test the failover between individual servers and clusters already to ensure the data centre continues to fulfil all services in the event of a crash or hardware failure, but what about when a power failure occurs? Is everything that should happen when the data centre loses power happening? Is it switching over to generators, informing staff and shutting down anything but the most critical of servers to conserve power? These kinds of tests should be happening regularly and should also be run for connectivity.
Maintenance
Maintenance checks should of course be running throughout the year on the data centre, and the frequency will depend on the type of equipment and manufacturers recommendations as well as your own policies. If in place, this regime will of course reduce the chance of unexpected problems, but it is wise to run a set of pre-holiday maintenance tasks. Some companies simply stagger standard checks in such a way that a set happens in late November, for example. But if December 15 is three months since any maintenance tasks have taken place – then you are leaving the door wide open for the Gremlins to walk in and get to work. And if you’ve not seen the movie – you should!
Tasks should include everything from checking for rodents, cabling, potential leaks, generator fuel and oil, UPS batteries, coolant and filters. There will no doubt be other tasks depending on the nature of your infrastructure – the important thing is to assess what they should be and get them done in good time.
Monitoring – and not just the bad news
Think about the monitoring systems you use, and more specifically how they are configured. Now is not the time to put a new system in, but it is important to check what it is monitoring and the conditions and parameters that will trigger automatic actions or alerts to staff. Are they tight enough, or in place at all? Use this time to fully assess them against your processes and IT ‘red list’ of problems. Check alerts are going to the right people – it is more common than most would like to admit that someone that left the company two years ago, is still in the monitoring software.
Alternatively sign up to an alert service centre which is a cost-effective way of knowing exactly what is happening 24 hours a day, seven days a week, 365 days of the year. Alarm conditions from your monitoring devices can be relayed via emails to the alert centre servers which will then automatically invoke the alert notification escalation procedure implemented, relaying a voice message detailing that an alarm condition has occurred.
Finally, get your monitoring software to bring you good news too. Better to get a daily report and know all is well, rather than be left wondering because your system is only configured to send alerts with bad news. Silence spreads fear, and you’ll just worry about whether the data centre has disappeared down a sink hole.
Have a great break over the holidays and if you are on call or working, I hope that you get no unexpected surprises.