How to build a maintenance strategy that protects uptime

With power, cooling and operational complexity putting pressure on data centre teams, Mike Slevin, Director of EMEA Market at Fluke Corporation, explains why maintenance needs to become a strategic discipline.

Data centre operators are caught in a tightening vice: uptime commitments are climbing, infrastructure is growing more complex, and resourcing remains lean. In that environment, teams cannot afford to treat maintenance as a loose collection of routine checks.

The conversation has shifted. Maintenance strategy has moved from a back-office technical function to a core part of facility resilience. For modern operators, the real test is whether maintenance effort is aimed in the right places and backed by clear operational control across the site.

If strategy dictates outcomes, the first operational decision a team must make is also one of the most important: out of thousands of components, what actually deserves the most attention?

Start with asset criticality

The first step is to identify the assets that matter most. A maintenance programme only becomes coherent when it reflects the consequence of failure and the asset’s role in service continuity. NFPA 70B, the US standard for electrical equipment maintenance, now makes that link explicit by tying maintenance frequency to equipment criticality and condition.

In a data centre, that usually pushes electrical distribution, UPS-related infrastructure, switchgear and the cooling chain to the top of the hierarchy, along with the pumps and motors that keep that thermal environment stable. These assets do not all carry the same consequence when performance starts to slip. Some faults will stay local for a time. Others can escalate quickly into a wider operational problem. That difference should shape how much maintenance attention each asset receives.

The stakes are clear. Uptime Institute’s 2025 outage analysis says power issues remain the most common cause of serious and severe data centre outages, and that more than half of operators stated their most recent significant outage cost more than $100,000.

Once the maintenance hierarchy is established, the next question becomes practical: which assets still justify fixed intervals, and which need a condition-based approach instead?

Match the method to the asset

How should different asset classes be maintained? A mature programme does not rely on a single method across the estate. Some tasks still sit within a preventive maintenance regime, with work carried out at fixed time or runtime intervals. Others are better suited to condition-based or predictive maintenance, especially where deterioration produces signals that can be tracked before failure.

That distinction counts because assets fail differently in service. Electrical equipment may justify routine inspection and testing at set intervals, particularly where compliance and safety sit in the background. Rotating assets, thermally stressed systems and equipment in harder-to-access areas often benefit from closer condition tracking because changes in vibration, temperature or load can provide earlier warning that performance is drifting. As a recent ARPA-E briefing notes, predictive maintenance can serve as a supplement to, or alternative to, redundancy by giving teams forewarning of when maintenance is needed.

What matters is matching the maintenance method to the asset and the failure mode, with some sense of how much warning you are likely to get. That only works, though, if commissioning, baseline data and test intervals were set properly from the start.

Start with a sound baseline

Maintenance begins with what gets verified at commissioning and how that information is recorded and handed over. Operators need more than a functional sign-off; they need usable baseline data, verified records and a clear picture of how key systems are meant to perform.

That handover becomes critical later, when teams must decide whether an asset is behaving as expected or starting to move away from baseline. ASHRAE Guideline 1.4 addresses this directly, advising that a proper systems manual should bring together testing and training documentation, operational requirements, maintenance schedules, verified record drawings, and sequences of operation. Crucially, it must also provide this data in a format ready for insertion into a CMMS.

If those foundations are weak, later interval-setting, trending and fault interpretation all start from weaker ground. But once a solid baseline is in place, the next question is how to inspect critical assets safely and consistently enough to keep those signals useful.

Make inspection practical enough to repeat

Safe inspection counts most where the plant is carrying a meaningful share of the site’s operational load. A March 2026 Parliamentary Office of Science and Technology note states that data centres use about 2% of UK electricity, with around a third of that energy going to cooling. It also adds that AI servers typically draw more power and fluctuate more than general computing workloads. In practical terms, that makes the cooling chain harder to treat as secondary infrastructure. Pumps, motors and associated support equipment need inspection methods that can show drift early and be repeated without excessive disruption.

The method matters as much as the interval. Depending on the asset, that may mean route-based thermal checks, fixed sensors or remote monitoring on the pumps and motors that support cooling performance. The point is to choose an approach that fits the equipment and the access conditions, while still producing information teams can act on. Good inspection design reduces risk and makes critical checks easier to repeat. It also helps teams identify developing issues earlier and keeps condition data useful over time.

But what does best practice look like when a developing issue is found?

Set the rules for intervention early

Turning detection into action means deciding who owns the call and when intervention is required, then planning the work before the issue hardens into an outage. It also means ensuring the practical side is in place. As the 2025 study An Empirical Data Model for Spare Parts Management notes, downtime often ends up being dictated by parts lead time. The same study also points to a familiar problem: maintenance, procurement, logistics and inventory data are often split across different systems and teams, which makes spare-parts decisions harder than they should be.

That is why condition data needs to feed a controlled workflow rather than a growing queue of alerts. Someone needs to decide what happens next, and the issue needs to move quickly enough to avoid becoming harder to contain. That means setting escalation rules and linking findings to work orders.

It also means thinking through likely spares and shutdown windows before an asset reaches a harder failure point. For lean teams, the value lies less in generating more signals than in making sure the important ones lead to timely, workable decisions.

The sites that handle this well treat monitoring as one input into a maintenance system that can absorb a warning and respond in time. That is the difference between better visibility and better resilience.

The discipline behind real resilience

The data centre industry can be drawn to the promise of predictive analytics, continuous monitoring and the idea of near-perfect visibility. But visibility without capability offers limited value. Even advanced diagnostic tools can only support resilience if the organisation has the operational discipline to act on the data they provide.

A mature maintenance programme requires a willingness to do the unglamorous work upfront: getting the priorities and baselines right, then building governance strong enough to turn a faint warning into a funded work order. It requires the slightly sobering realisation that a two-week early warning is worth far less if the required spare part has a two-month lead time and no one holds the authority to schedule a shutdown.

Ultimately, facility reliability is not a product that can be purchased and bolted onto a weak foundation. It is a deliberate, ongoing practice. In an industry that spends billions eliminating single points of failure in power and cooling, the most dangerous vulnerability of all is a team that sees a problem coming, but lacks the strategy to stop it.