Why hybrid cooling is the way forward

AdobeStock_595818347 — Image: Adobe Stock

When it comes to cooling, it’s time to go hybrid to survive, says Venessa Moffat, DCA Advisory Board Member.

There’s no doubt that the growing demands of processing GPU-intensive AI workloads are placing enormous pressure on current data centre infrastructure and operations. However, things are set to become even more intense, with Elon Musk recently describing the pace of AI compute growth as being like “Moore’s Law on steroids.”

With high-density workloads now typically running at over 30 kW per rack – and some even reaching 70 to 100 kW per rack – it’s clear that the standard 5 to 10 kW per rack data centre, supported by traditional air cooling, is starting to look like infrastructure on borrowed time

While operations teams need to think hard about anticipated AI workloads, their current data centre infrastructure, and how it will need to change in terms of cooling, I suspect it’s unlikely that there will be a wholesale shift towards immersion cooling in order to cope with the inevitable extra heat generated. Indeed I would suggest that air cooling and other forms of liquid-based cooling will remain an important factor in the data centre cooling mix – most likely as part of an evolving hybrid cooling approach.

Let’s consider the likely technical scenarios that data centre operations teams face when considering evolving cooling requirements. First, it’s worthwhile noting that this whole cooling debate is nothing new. Liquid cooling has been around since Cray X-MP supercomputers were launched in the early 1980s – hence its ‘bubbles’ nickname – while a second wave of liquid cooling followed to support the introduction of blade servers by vendors, such as HP, some 15 years ago. So what are the options now?

Traditional air cooling: Most standard data centres have been running at 5-10 kW per rack and are supported by traditional air cooling. With only incremental workload increases, it might make sense to stick with air cooling, but that’s simply not going to be realistic with anticipated AI compute requirements.
Enhanced air cooling: As workloads start to head towards 15 kW – 30 kW per rack, existing data centre infrastructure inevitably gets stretched unless they are very well managed. There will be an increasing requirement for an enhanced air cooling approach with in-row, rear-door cooling, or high volume fan walls.
Hybrid cooling: With the wider deployment of ultra-high-density AI racks, air cooling alone isn’t enough. This hybrid environment is where existing air cooling systems become supplemented by Direct Liquid Cooling (DLC). The largest AI compute racks can potentially require up to 100 kW per rack.

For data centre operations teams currently considering the right cooling approach, there are clearly a range of factors to consider. There’s been a general assumption that Direct Liquid Cooling (DLC) is simply going to take over from air cooling, but there’s a number of very practical reasons why that’s not likely to happen.

From a technical and engineering perspective, immersion cooling can deliver great performance but there are still potential concerns around oil spilling, the lack of ability to make fibre connections, and issues with the liquid interfering with the light interface. Some components and PCBs degrade in the liquid cooling medium, and there are practical concerns around equipment replacement difficulties, the need for oil replacement, and the need to change out fans, heat sinks and the thermal paste on chips – all of which may invalidate warranties.

Data centre operations are also finding it challenging to manage supply issues associated with the massive demand for control processors and associated liquid cooling. With increasing numbers of GenAI application deployments looming, sourcing and deploying these technologies on time will become difficult, and many data centres need upgrading.

If a company goes for a fully liquid-cooled approach, there will still be a requirement for some level of room cooling using circulating air since the direct liquid cooling technologies are not 100% efficient, and there will still be heat generating elements in the room, such as lighting, fibre switches, legacy disc storage, network switches etc.

Lastly, the external heat rejection equipment required to remove the heat generated by the IT equipment is often forgotten when it comes to immersion cooling in particular. This also needs to be planned and costed into any DLC cooling upgrade projects.

Adjusting to AI’s new engineering realities

So if DLC alone is difficult, is air cooling still the answer? While we’ve seen air cooling able to deploy up to around 30 kW per rack, you can sense it’s starting to hit the limits of what’s achievable. CIOs and their operations teams know that AI’s remodelling of the data centre is well under way and shows no signs of slowing down. There’s a real need now to adjust to AI’s new engineering realities.

Maximising air cooling performance is obviously important, but it’s getting harder to overlook the actual impact of full-intensity air cooling. Given that many existing data centres can be more than 20 years old, the reality is that the noise of fans, airflow velocity and its associated pressure can easily top 100 dB and make for a difficult working environment. Within these environments, more focus on health and safety will be required moving forward.

What’s the answer? Go hybrid to survive

Data centre teams know that the infrastructure decisions they take now have the potential to constrain their AI plans if they get locked into a particular approach. They really need to be prepared for what’s likely to happen from an infrastructure and engineering perspective when they launch their AI services – and that requires absolute real-time white space visibility. So how will data centre cooling evolve over the next 18 months?

Firstly, air cooling isn’t going away. Data centres are still going to need their current air cooling infrastructure to support their extensive existing low density workload commitments. However they will also need to take the time to optimise their current thermal and cooling performance if they’re to unlock capacity for additional IT loads.

Next, it’s important to note that liquid-cooling environments do have limitations. It isn’t practical or possible to run a completely liquid-cooled data centre, and there’s probably not enough time, experience or underlying necessity across our industry for everyone to jump into immersion cooling just now. Also, prior to a move to DLC, the impact on the external heat rejection plant needs to be considered as this may well need to be modified. Getting heat out of the servers is one step, but getting the heat out of the building is frequently overlooked in a good deal of the marketing material promoting DLC.

The answer is to combine both air and DLC cooling in a hybrid approach. Key questions to consider here include the exact blend of air and liquid cooling technologies you’ll need, and a clear insight into your plans to accommodate higher density AI racks with their greater power and infrastructure requirements alongside more traditional power density workloads.

Data centre management teams need to first ensure that their air cooling performance is fully optimised to support current loads – and then get the liquid cooling in as necessary. This may take a few months, but that’s achievable. Once liquid cooling is deployed, you need to ramp it up and run it at an optimum temperature to maximise energy efficiency, and then backfill with air cooling after to create the best, most efficient hybrid model that is currently possible.

Taking this cooling model forward, you’ll also need to make sure your hybrid cooling environment remains fully optimised, particularly as workloads continue to scale upwards. Applying best practice optimisation at a granular level, and applying AI optimisation technologies used to support your AI workloads will become increasingly critical.