The average person reading a description of a High Reliability Organisations (HRO) will probably conclude that a data centre does not quite fit this description.
One of the earliest definitions of an HRO proposed that these organisations, “…operate potentially hazardous technical systems under very demanding conditions, while maintaining a level of performance and safety far above what might be expected …”
Over time, the definition evolved, and now most variations include the consequences of failure. For example, “…enterprises that perform missions involving processes that require extraordinary measures to maintain low risk in the presence of disruptions that could result in catastrophic events (e.g., radiation or well leaks leading to long-term environmental damage) or fatalities (e.g., epidemics, air traffic control).”
This description or its variations are still widely in use, but it came into existence well before the rise of the internet and cloud computing. As cloud computing evolved and the attendant innovations have unfolded, the mass transfer of the world’s data and data-driven services to the cloud now poses a significant risk.
According to IBM in 2016, approximately 2.5 quintillion bytes (2.5 exabytes) of data are created daily, and 90% of the world’s data were created in the last two years! Data centre owners now have a special responsibility to protect the world’s information. Based on what is at risk in the event of a disruption event, regardless of severity, in our view, data centre organisations should join the ranks of HROs such as hazardous chemical processing organisations, nuclear power producers, air traffic controllers, health care providers and so forth.
We believe the catastrophic events that define an HRO should be expanded to include information or more specifically, data. Using the above definition, a simple modification such as “disruptions that could result in catastrophic events, catastrophic data loss, or fatalities” would suffice.
The ‘so what’ of this change would be a focused effort on the part of the industry to ‘up their game’ and in doing so embrace operational readiness and organisational reliability principles.
Ultimately, HROs are focused on reliable operations which encompass the capacity to maintain performance during complex, uncertain, and unexpected situations. The cultural norms of HROs identify early warning signs and respond immediately to maintain or restore the system; have a questioning attitude to predict and avoid errors; and measure performance against targets often grounded in safety criteria.
Data Centre operators do all of the above. However, that has not prevented a number of very public incidents on their networks with cloud-based services providers experiencing publicly reported and reputationally damaging outages – so much so that outage tracking webpages are maintained, including that of Cisco-backed ThousandEyes.
The causes of outages are varied, but typically fall into these broad categories; software, IT network, power, and human error. While some outages reach mainstream news feeds, such as those experienced by Amazon and Facebook in the latter part of 2021, other incidents go unnoticed but are nonetheless impactful and costly.
Use of, and dependance on, data centres to facilitate modern living is not diminishing. The Uptime Institute notes in the summary of their annual outage analysis report that, “The level of investment in new data centres, in an ever-increasing amount of IT capacity, and in new IT services in recent years has dwarfed that of all previous decades.”
Given the data centre growth trajectory (estimated to grow from $59.3 billion in 2020 to $143.7 billion by end of 2027), ever-increasing demands are being placed on service providers to bring online facilities in ever-more aggressive timeframes. It is little wonder that outages happen.
Interestingly though, publicly reported outages as published by the Uptime Institute actually fell from a peak of 163 in 2019 to 119 in 2020 – at the time of writing, 2021 official figures were not available but ‘one swallow does not make a summer’, therefore additional data points are needed to understand if the dip in 2020 is an anomaly or a trend.
However, one thing is certain, placing ever increasing demands on a system or industry to consistently deliver good outcomes does not imbue that system with 100% availability even if the system is ‘five nines’ certified. Therefore, if data centres are to be recognised as HROs, and we feel they should, this does mean all stakeholders need to ensure robust measures are in place to deal with the incidents when they do occur.
The data centre industry has done much to ‘design-in’ excellent levels of availability and provided the design intent is achieved, those facilities should operate quite successfully – however, let us not forget the impact of the human in any realised design.
While much is made of our collective prowess to engineer our way through particular incidents, research findings point to vagaries of the human and human error. It is in this context that we feel embracing operational readiness and organisational reliability are key weapons in our arsenal enabling the successful delivery of the forecast data centre growth projections through 2027.
But lest not forget, delivering the capacity is but a small fraction of the whole facility lifecycle. Getting to the operational phase is the easy part (mind you, I imagine some in the wider supply chain would argue this point…) whereas keeping a facility in an operational ready state is where the fun really begins.
We pose one final question here: can the supply chains and wider economy feeding the data centre industry actually deliver the equipment and construction intensity required to achieve the growth and capacity demanded by burgeoning collective of data ‘creators and users’?
Time will tell…