Day Two is the real stress test for AI infrastructure

Matt Salter, Data Centre Director at Onnec, outlines how sudden demand surges, thermal events, and component lead times force operators to prove resilience in real time, not on paper.

Constructing AI-ready infrastructure is only the first milestone in the journey to providing AI compute. The real test begins once the facility is operational, servers are installed and workloads go live.

Day One focuses on planning and construction: blueprints, power distribution, cooling systems, connectivity and redundancy. These are all measurable elements that make a facility ‘AI-capable’ on paper.

Day Two, however, introduces complexity and unpredictability. Thermal spikes, workload surges, equipment failures and supply chain delays quickly expose the gap between design assumptions and operational reality.

Day Two is when resilience moves from theory to practice. AI workloads are inherently volatile, and stress conditions often emerge only once systems are live. How well a data centre adapts, responds and maintains performance under pressure separates designs that succeed from those that falter.

AI workloads and infrastructure stress

AI workloads behave very differently from traditional enterprise or cloud computing. Dense GPU clusters generate concentrated heat and draw power in sudden surges, sometimes changing markedly within seconds. Industry commentary has increasingly highlighted how these dynamics can strain transformers and upstream electrical infrastructure, creating fluctuations that older data centres were never designed to handle.

Networking interconnects can also become saturated by unpredictable east-west traffic, while even small inefficiencies in cabling, containment or floor layout are amplified under load – creating hotspots and airflow bottlenecks that compromise performance.

Operating under these conditions is a far greater challenge than building the facility. Thermal events can arise abruptly, and misaligned cooling, power distribution or interconnect capacity can quickly lead to performance degradation or downtime.

Older facilities, designed for lower-density racks and slower-growing workloads, are particularly vulnerable. Even where redundancy exists, the intensity and volatility of AI workloads demand rapid, continuous response, leaving traditional monitoring and manual intervention insufficient.

Legacy infrastructure compounds these risks: many centres can’t support modern interconnect technologies such as InfiniBand, and industry incident analyses frequently link outages to preventable issues in cabling and cooling practices.

In AI-scale environments, engineering decisions on airflow, rack density and cabling quality directly influence whether a facility can maintain performance under sustained, high-intensity workloads.

Supply chains, maintenance and skilled operations

Infrastructure stress is only part of the picture. Supply chain constraints further complicate operations. Critical components such as GPUs, optical modules and cabling often have long lead times, and replacement can take weeks rather than days.

Even minor interruptions can escalate into significant operational issues if spare capacity, inventory management and contingency planning are not in place. According to the Data Centre Cost Index, 80% of operators report delays in manufacturing or delivery of essential equipment.

Shortages extend beyond GPUs; advanced fibre, switches and cabling are all in high demand, with multiple operators competing for the same scarce stock. Without timely access to the right components, even carefully designed facilities can struggle to maintain performance and execute planned upgrades.

Design choices and long-term resilience

Skills and process only go so far if the design limits operational options. Data centres must be engineered to be resilient and modular from the outset, because early design decisions often determine how effectively teams can deploy, monitor and maintain systems under real-world pressures.

Decisions made during design and construction have lasting operational consequences. Structured cabling, modular mechanical systems, spare power and cooling capacity, and flexible interconnect architectures all reduce the need for costly retrofits. Forward-looking design supports change without unnecessary disruption.

Starting early is vital, particularly when factoring in external constraints on designs that impact resilience. Labour shortages, regulatory changes, ESG compliance requirements and regional supply chain bottlenecks can all influence performance if not considered early.

In AI data centres, infrastructure and operations are inseparable: monitoring depth, operational runbooks and proactive planning are as important as the hardware itself. Facilities that embed these principles are better equipped to manage volatility, reduce downtime and maintain reliable performance even under extreme conditions.

Day Two defines long-term success

Building an AI-ready data centre is an achievement; operating one reliably under high-density, dynamic workloads is the true test. Day Two challenges assumptions about power, cooling, networking and staffing, revealing whether a facility can sustain AI workloads continuously.

Success is not measured by capacity on paper but by the ability to maintain uptime, handle surges and adapt in real time.

Where on-site coverage is limited, some operators use third-party on-site support (‘smart hands’) under tightly defined runbooks to execute urgent maintenance and fault isolation. The goal is speed and consistency: shorten time-to-diagnosis, reduce time-to-repair and keep changes controlled when conditions are already stressed.

As AI workloads expand across industries, Day Two operations will determine which facilities can scale, perform and remain resilient. The data centres of the future will integrate infrastructure, monitoring and operational strategy seamlessly, with proactive response embedded into everyday practice.

In the era of accelerated compute, the real test begins once the build is complete; it is on Day Two that long-term reliability is earned.