Incident response gains with AIOps

Frank Emery, Principal Product Manager at PagerDuty, outlines some of the major pain points software engineering teams face with managing system data, and where AIOps shows value.

ITOps teams have more than enough data. What they have far less of is actionable information to resolve mission-critical problems with – and there’s no end in sight for this problem, as event data continues to grow and context becomes harder to find in the deluge.

PagerDuty’s platform data shows a 70% increase year over year. According to a global survey we conducted of 700 IT professionals, 69% of operations teams struggle with making sense of this influx of data. To make it worse, as data volumes rise, resources stay the same – or are reduced. 

ITOps teams need a way to gain context immediately and reduce toil during an incident to preserve more of their time for strategic initiatives that drive the business forward. Leveraging AIOps is the solution for many of our customers. According to Gartner’s definition, “AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination.”

A March 2023 EMA Research Report by Valerie O’Connell, Research Director, indicates that AIOps is top of mind for IT leaders. The top initiatives IT leaders surveyed are investing in include AI and automation for predictive/proactive operations (58%) and AI and automation for self-service and unattended actions (51%).

Screenshot 2023 08 11 at 1.44.47 PM

Graphic: Automation, AI, and the Rise of ServiceOps, March 2023 EMA Research Report – page 12

Implementing strategic initiatives like adopting AI/ML and automation empowers ITOps teams to take back their time, reduce MTTR, and improve customer experience. This results in benefits across the organisation. But, it also requires a strategic implementation to prove value.

AIOps improves the incident response process

AIOps is a key strategy for ITOps teams wanting to gain actionable data whilst cutting through the noise. It’s the route to go down to start solving these issues to create a sustainable workload and empower the team.

Firstly, it can immediately reduce the volume of noise generated by business systems. Two facets of noise reduction include alert grouping, which is where alerts that are similar enough are grouped into the same incident. This ensures that responders don’t have to receive notification upon notification for the same issue. The other facet is eliminating transient alerts, or alerts for problems that resolve themselves within a matter of minutes. Removing these low-value interruptions from the equations adds up to more focus time and fewer incidents.

Additionally, enhancing triage is a key gain for technical teams. The context gained from ML-aggregated contextual data helps jumpstart the response process, arming teams with information such as historical context, system status, change correlation and more.

Add in end-to-end event driven automation, and data becomes actionable not just by humans, but by machines as well. With data that’s normalised across the system, everyone is on the same page during response. And, leveraging automation, teams can even run auto-diagnostics to understand the current state or leave the remediation entirely up to machines.

Finally, the overall benefit is, of course, better customer and team experiences. Letting developers and engineers do meaningful work rather than tedious, repetitive tasks leads to happier, more productive colleagues that can focus on solving problems. Always preferable to fighting fires or ticking boxes.

Realising the benefit

When considering AIOps projects, do take into account the operational maturity of the business, its processes, and the skills of the team. Look to solve a clear business problem. Score the importance of key areas such as:

  • Creating more time for developers to ship value-add features
  • Catching issues before they lead to bad customer experiences
  • Reducing MTTR with added context and intelligent incident routing

    Make sure to factor a few critical criteria in the business decision, including:
  • Ease of implementation: How long does it take to see value? How many hours will the team spend and what’s the trade-off?
  • Ease of maintenance: If a technology is not manageable by current teams, what resources will you add? Is that reasonable at this time?
  • Continuous learning: Can a solution adapt and improve, growing more valuable over time?
  • Orient to action: Are the insights high enough quality to drive better actions and incident resolution?
  • Flexibility: A platform doesn’t fulfil its promise if it’s only usable by a select few teams. Who can see value in your new solution?

Any AIOps solution should be configurable to your own team’s working practices, with an adaptable AI that meets its team’s needs – not the other way around.

As with most IT projects, it’s sensible to pick just one area of business pain, set a KPI on what success looks like, and direct the first AIOps use case there. Implementing and testing in a phased approach allows the greatest opportunities to get it right for the least potential disruption.

The crawl, walk, run approach offers a guide to expected results over the adoption phase.

Crawl: Suppress non-actionable alerts and reduce transient alerts. Non-actionable alerts are low-value informational alerts that distract responders. Suppress them to limit interruptions. Look for alerts that commonly resolve themselves after a short time. Create automation to only notify a responder about the transient alert if it doesn’t resolve by itself. This gives your responders more time back for value-add work.

Walk: Enrich your events, alerts and incidents. Once you’ve reduced your noise and eliminated transient alerts, the next stage is to make sure that the events, alerts and incidents your teams do interact with are as informative as possible. By creating automation that adds context to the response process, you will see reduced MTTR.

Run: Craft auto-remediation. For well-understood and documented incidents, you can create automation that resolves the incident before a human ever gets involved. This brings a benefit not only to responders who preserve their time for deep work, but also to customers who see fewer and shorter interruptions to their service.

A perfect fit

Incorporating AIOps, empowering the ITOps teams, and gaining sustainable business benefits requires the careful approach outlined above. Be realistic and aim for little victories on the pathway to full adoption. Starting with modest goals, testing and improving, is a winning approach. With an example like that proven, adoption can snowball.

Overall, remember: if it doesn’t help the team drive down MTTR, protect customer trust, and mitigate risk, then it isn’t serving the business and you should reevaluate. At the pace of modern business, ITOps teams deserve a helping hand keeping profitable services online and the business on track.

Related Articles

Top Stories