Ross Hamilton, Principal SAP Technical Consultant at Absoft, explains the importance of having robust disaster recovery strategies.
In IT and cloud computing, system crashes and outages can have profound impacts on businesses of all sizes, all across the world. Systems that usually go unnoticed, working behind the scenes, can at the click of a button be brought down, bringing industries to their knees.
The recent outage involving CrowdStrike, a leading cybersecurity firm, highlighted these vulnerabilities when an update to its Microsoft Windows antivirus software caused widespread system failures to multiple companies; the airline Delta alone cancelled 6,000 flights, grounding 500,000 passengers with the whole event costing them $500 million. As the dust settles on this outage and businesses begin to recover, it is essential to glean key lessons and best practices for IT outage management and disaster recovery.
The outage and initial response
The CrowdStrike outage was triggered by an update to its antivirus software, leading to instability that caused each machine to crash and enter a restart loop, commonly known as the ‘blue screen of death.’ This disruption affected numerous systems, including SAP, posing a significant challenge for IT teams worldwide. Adding to the problem, the global and simultaneous rollout of the update made downtime unavoidable.
Specialist IT teams played a crucial role in mitigating the outage’s impact. For those affected, most monitoring systems were able to detect the issue quickly, prompting a swift response, but not before the issue had the chance to wreak havoc in areas such as airports and healthcare. This however was not an easy task, with each system requiring manual intervention, which proved to be challenging, especially for hidden or inaccessible machines.
Following this, rigorous checks were conducted to ensure stability and functionality. This proactive approach highlights the importance of readiness, testing updates in an offline environment, effective monitoring, and the ability to respond swiftly to unexpected challenges.
Best practices and lessons learned
One critical takeaway from this outage is the importance of pre-deployment testing. Updates, whether security-related or application-specific, should first be rolled out in a dedicated test environment. Bespoke solutions can ensure that operating system (OS) security updates are initially deployed in test environments, which helps to prevent similar issues from affecting business production environments. Organisations must also ensure that their testing processes are comprehensive and mirror their production systems as closely as possible. Missing out this step can lead to inadequate testing and unforeseen issues during live rollouts.
A robust disaster recovery (DR) plan is also essential for any business relying on IT systems for automatic daily operations. Most large companies have such plans, complete with procedures, checks, and regular audits. While a DR plan may not have been directly applicable in the CrowdStrike scenario, it is crucial for overall preparedness. Companies should review and test their DR plans at least annually, and especially after significant infrastructure or application changes. Regular testing and updates to DR plans also help to ensure that businesses are prepared for various scenarios, including those that may not be anticipated.
The CrowdStrike outage also underscored the need for multi-layered security. Organisations should avoid relying solely on a single security mechanism. A multi-layered approach can significantly mitigate the risks associated with routine updates and ensure greater system stability. Redundancy and diversity in security measures can prevent a single point of failure from causing widespread disruption.
The value of expert IT support teams
The broader implications of the CrowdStrike outage for the IT industry are significant, with some estimates suggesting it could end up costing around $1 billion in damages and lost revenue. It reinforces the necessity of accurate testing environments and rigorous testing protocols. Continuous improvement in industry standards and practices is also essential to better mitigate the risks associated with routine updates.
What is more, companies that were affected by the CrowdStrike outage included those diligently updating their security systems, demonstrating the complexity of maintaining IT security effectively. The outage suggests a need for a more cautious approach, even when implementing updates designed to enhance security.
Expert support teams play a vital role in managing and recovering from IT outages. A proactive support team can swiftly identify affected machines and perform necessary fixes, minimising user disruption. The efficiency and readiness of support teams are critical in such scenarios. Companies relying heavily on offshoring, for example, may face longer recovery times due to logistical challenges, underscoring the need for localised, hands-on support. Having a support team that can respond quickly and efficiently, regardless of geographic location, is crucial.
Selecting the right support partner is also critical for effective IT disruption management too. Businesses should seek partners who know their business well, are proactive, security-focused, and willing to critically evaluate and improve their clients’ processes. Such partners ensure efficient outage handling and help prevent future occurrences. The right partner will not only provide immediate assistance during crises but also work proactively to enhance the overall IT infrastructure and processes across the board.
Be prepared
The CrowdStrike outage offers valuable lessons for IT professionals and organisations to learn from. Thorough pre-deployment testing, robust disaster recovery planning, and the presence of expert support teams are critical components of effective IT management. By incorporating these best practices, companies can enhance their resilience against similar outages in the future, ensuring greater stability and reliability in their IT operations.
Continuous improvement and vigilance are also key to maintaining robust and secure IT systems. By learning from the CrowdStrike outage as well as other such disruptive incidents, organisations can better prepare for and mitigate the impacts of future events, whatever they may be.