Resilience starts with the network – and ends with operational discipline

Ramtin Rampour, Principal Solution Architect at Opengear, argues that true resilience is not just about redundancy on paper, but about building recoverability, access, and process maturity into day-to-day operations.

Resilience depends on networks that stay manageable under pressure. Access and visibility must remain available when normal connectivity paths fail. Across the UK, resilience has become a board-level concern, as essential services rely on robust digital infrastructure. In September 2024, the UK Government confirmed that data centres would be designated as Critical National Infrastructure, putting them alongside sectors like energy and healthcare.

Operators know this well. Their estates stretch across private facilities, cloud regions, and edge sites, often with minimal local support. The network connects them all, and when it fails, teams lose the tools and access needed to restore service.

That said, resilience is about more than just avoiding failure. Failures are inevitable, whether caused by human error, automation failures, cyberattacks, system faults, or external events such as power cuts. True resilience means maintaining control and protecting uptime when failure happens.

The real cost of network failures

The network is how services are delivered, monitored, and secured for almost every organisation today. It is also the route engineers use to reach infrastructure during an incident. If that route disappears, recovery becomes slower and riskier. That is a growing concern, given the prevalence of network downtime. In Opengear’s own research, over half of organisations polled reported a 10% to 24% increase in network outages over a two-year period.

Costs are rising too. In its Global Data Center Survey 2024, Uptime Institute reported that, for the second consecutive year, 54% of respondents said their most recent significant outage cost more than $100,000, alongside an increase in outages costing more than $1 million.

Extended downtime affects far more than the balance sheet. Trust is damaged, service commitments are breached, and teams are pulled away from planned work for weeks on end. The message for infrastructure leaders is clear: the network is central to operations and business continuity. It must remain resilient not only when conditions are normal, but when they are not.

Why resilience is harder now

Resilience is being squeezed from two directions at once. Complexity is rising, while operational headroom, or the capacity to respond to unexpected events, is shrinking. Hybrid estates now incorporate colocation, private data centres, public cloud, and edge environments. Each introduces its own security protocols and operational practices, raising the risk of visibility gaps, inconsistent controls, and configuration drift.

This complexity is where day-to-day operational risk often hides. For example, Opengear research found that device configuration changes were cited by 27% of network engineers as a leading cause of outages. Routine tweaks can introduce vulnerabilities or unintended consequences that affect availability. This risk is amplified when teams are stretched thin or working across multiple environments.

Operational headroom is further strained by a skills shortage. Critical skills in network management, automation, and cyber security are harder to find and retain. Security skills are particularly scarce. ISC2 reported a global cyber security workforce gap of 4.8 million in 2024. Network and infrastructure teams feel the same squeeze, especially when they are expected to keep legacy platforms stable while adopting automation and meeting more demanding security requirements.

Security itself can also become a source of fragility when treated as a standalone set of controls rather than something embedded into operational design. Threat actors target management planes to gain control, while routine work such as patching or credential rotation can introduce downtime if handled poorly.

The result is a more complex operating environment, with more moving parts, more threats, and less spare capacity. Resilience, therefore, has to be embedded in how teams work, not just left to redundancy diagrams.

Building resilience into operations

Resilience improves when organisations treat it as a discipline that spans architecture, people, and process. On the technical side, it starts with recoverability. Teams need a dependable way to regain access when the primary network is down.

When a cyber incident or infrastructure fault disrupts connectivity, out-of-band access can enable remote monitoring, troubleshooting, and remediation without relying on compromised systems.

Faulty equipment can be rebooted, isolated, or reconfigured immediately, without waiting for onsite access or partial network restoration. In practice, that can make the difference between a brief disruption and a prolonged outage.

Resilience also means staying ahead. Networks must adapt in real time to shifting workloads, AI-driven processes, and high-density compute demands. Software-defined networking can support a more flexible, programmable control plane to help organisations scale and respond more effectively as conditions change.

This also matters in incident response. When a breach is suspected, containment decisions must happen quickly. If teams cannot access infrastructure because remote access depends on compromised identity systems or unstable links, response slows just when speed matters most.

Automation is equally important to resilience. It reduces human error in repeatable tasks, improving recovery speed and consistency. Strong configuration management, controlled templates, and pre-deployment validation help prevent drift. Telemetry and alerting must also be tuned to service impact, so teams are not flooded with irrelevant data during incidents.

Process maturity matters just as much. Many major outages follow a familiar pattern: a rushed change, unclear monitoring signals, slow escalation, and access bottlenecks. Clear runbooks, rehearsed incident response, and well-understood escalation paths can reduce downtime and confusion significantly.

The human factor

Resilience also depends on people and culture. Long-term continuity requires resilient organisations, not just advanced technology. That means investing in skills, documenting procedures that reflect real conditions, and running simulations that test access, decision-making, and communications. It also means building a culture of trust, where teams can report near misses, challenge risky changes, and learn from incidents without fear of blame.

Resilience is vital to modern IT because businesses depend on uninterrupted data centre and network operations. There is a reason demand for Site Reliability Engineers has grown year on year. Organisations that can maintain continuity of service will be better placed to protect essential systems, support critical national infrastructure, and sustain business operations in the face of disruption.