Wendy Zhao, senior director, Alibaba Cloud Intelligence takes a look at AI and automation within the data centre from the perspective of public cloud, exploring where we’re at with this technology now and where it’s heading in the future.
The era of automation and AI is fast approaching, and promises to change many industries and aspects of daily life.
The data centre, being a space where relatively few people work with different layers of technology at vast scale, is a leading candidate for the use of such advanced technologies to monitor and intervene in operations in order to maximise utilisation, optimise efficiency and guarantee reliability.
Autonomous, self-operating connected facilities containing the physical infrastructure running public clouds – where fully automated monitoring systems instruct AI powered robots to carry out physical maintenance and repair without any human involvement – is an attractive vision, but is not, as yet, today’s reality.
But already, AI technologies and automation are helping run hyperscale data centres and solving seemingly traditional DC operation problems. All data centres want improved reliability, increased availability and lower capex and opex.
As public cloud infrastructure scales beyond 100,000 servers in a single facility, with 100s of petabytes or even exabytes of storage and where networking IO speeds of 200Gbps are normal and 400Gbps is the target, then clearly any improvement in uptime and utilisation has major implications for both the operator and customers.
Managing failure at scale
For example, in terms of operational efficiency, the number of hardware failures increases linearly with the number of servers, while maintaining the same hardware quality (i.e. failure rate), which requires new modes of operation to resolve issues such as failure detection, root cause analysis, and repair, efficiently.
Cost at scale becomes a critical factor for the business. The physical shell, data halls, the electrical/cooling infrastructure, servers and networks all need to be co-designed and co-optimised during operation to achieve the best performance at the most effective cost.
And as important as cost is, reliability is not simply a cost factor. Because of the critical nature of customer workloads, the data and applications being hosted on public cloud IaaS or PaaS infrastructure, means any failure would result in real customer dissatisfaction, or hurt many businesses in significant ways. Failure prevention and recovery through prediction, anomaly detection, and self-healing becomes necessary in the cloud era.
Self-driving data centre
Can a self-driving data centre become a reality? One where monitoring of every integrated piece of technology from batteries to UPSs to cable infrastructure to servers and hard disks means failures are predicted and automatically fixed before they occur? Where whenever there is an outage of any component, the infrastructure itself is self-healing? The answer today is no. But the journey has started.
Automation and AI is the clear direction of travel for data centre operations. The constant development and introduction of technologies for monitoring and failure detection to deliver far shorter scheduled downtime, better MTBF (mean time between failure) and predictive maintenance and reduction in failure duration and MTTR (mean time to repair) is ongoing.
AI technologies such as deep learning, machine learning, statistical learning, and optimisation algorithms are used to model complex components and operational models inside the data centres. Together with sensing technologies and automation systems, the aim is to achieve more efficient and reliable data centre infrastructure with lower cost.
Adding additional sensors or data collection mechanisms to the equipment and critical components inside the data centre means the data is collected with required frequency to capture fast transient events, and transmitted to the AI-based monitoring system via high bandwidth networks without interruption.
It is happening at every level. By constantly collecting data on failure rates in normal operations, whether in technology which suffers degradation with relatively high failure rates – such as hard drives or SSDs, or low probability events such as battery defects – predictions allow the operator to inform and prepare customers in advance of potential risks. It requires experimenting with different techniques where the goal is to create a close-loop system where the patterns and insights gained from massive amounts of data are used to guide operations, with all decisions and actions being truly data-driven.
Just around the corner
So why aren’t data centre operations fully automated and overseen by AI today? It is partly cultural, partly technological and partly timing. Public clouds are built on trust at a business and technology level. When moving applications to the public cloud (as opposed to building one’s own data centre IT infrastructure), customers can benefit from better performance, reliability, and lower cost. But they must trust the infrastructure. And, in the cautious world of data centre operations, entrusting your critical infrastructure and workloads to an AI-run operation would today be considered an audacious move.
Yet it is likely that many of the AI and automation advances will happen within the public cloud providers who face the much more prominent scaling challenge, and who can leverage economies of scale to innovate and find new solutions to evolve the data centre industry more rapidly. They will provide value through gradual introduction of new concepts and technologies, especially after validating them first with their own applications. The open sharing of advances to build industry standards will be vital to success.
In terms of timing, AI models require large data sets in order to learn. Before anyone switches over an entire public cloud data centre to AI, trust in system accuracy has to be built to a point where operational safety can be guaranteed. Currently, human domain expertise provides the oversight and has intervention and final decision-making powers. As the AI becomes better and better, more automation comes into play – though human intervention will always exist as the last resort.
New expertise in traditional domains
Traditionally, data centre operation is very much process and procedure based, heavily relying on human inspection and response. The new skills required in the data centre are in data mining techniques and in advanced AI technologies such as machine learning and deep learning.
There is also a need for applied engineering expertise of data centres, networks and servers to help build more accurate models, achieve faster convergence and to make sense of the patterns and insights discovered from the data.
There is experimentation with simulation techniques or mathematical models to cover new scenarios which do not have enough historical data to work with. Applying AI and automation to the data centre world opens up lots of new opportunities for professionals and public cloud operators. This is just the beginning of this journey and an exciting time to explore the space and help reinvent the industry.