The required cooling of a data centre is a topic that has gained a fair amount of traction as of late, mainly down to the fact that newly released technologies are becoming more demanding in terms of processing power, resulting in a substantial increase of heat being generated inside most modern data centres – environments that were not necessarily designed to handle these ever-increasing power demands.
Typically speaking, most data centres rely on air cooling to maintain optimum server, storage and networking conditions and have done since the concept of data centres came about. The basic principle of air cooling being that cold air is blown across either a single rack, or number of racks, thus exchanging the warmer air with much cooler air – not the most complicated practice, but not the most environmentally efficient one either.
In theory, a single rack inside a data centre should be able to handle solutions of around 20kW, but it seems the market is rapidly heading into an area where the demand is reaching the 40-50kW figure at the very least, with some highly dense solutions requiring even more power per pack.
The market landscape
Both AMD and Intel have already released the finer details around their next generation of CPUs, the EPYC Genoa (based on Zen 4 architecture) and Xeon Sapphire Rapids, respectively. The much talked about flagship Genoa CPU packs a punch, delivering 96 cores and 192 threads, but the chip also possesses a staggeringly high TDP of 400W, with the Intel competitor likely to be very similar in terms of TDP.
There is also a sharp increase in demand when it comes to the latest generation of NVIDIA offerings, with the NVIDIA H100 PCIe Gen 5 card stated as having a Thermal Design Power (TDP) of 350W and the alternative SXM form factor having a TDP of 700W. With both CPUs and GPUs becoming more power-hungry as time goes on, the limits of air cooling for data centre solutions may quickly be approaching, especially for those hoping to use predominantly dense configurations.
If this is to be the case, then what are the feasible alternatives that data centres are likely to be adopting in the near future?
Liquid cooling is a technique that has been implemented over time and shows promise when trying to help tackle many of the challenges that come with the limitations of air cooling. While there are various methods of liquid cooling, three stand out as being the most frequently used – immersive, Direct to Chip and Rear Door Heat Exchanger (RDHx).
From a very high level, ‘immersive’ cooling is a method where full systems can be immersed entirely in non-conductive liquid. Normally, the systems will sit inside a sealed container that is filled with various types of liquids, with any generated heat being removed through the use of cool heat exchangers. Immersive cooling is seen by some people as the most efficient way of cooling modern HPC servers.
Another technique is that of ‘Direct to Chip’ cooling. This being when chilled liquid is pumped directly into a server chassis using small tubes, directly onto cold plates, which are installed to sit over the CPU, GPU or similar components. Any warm liquid generated is then pumped to a second CPU (in dual-socket systems) or alternatively straight back to a unit that will chill the warmer liquid and repeat the cycle. Direct to Chip cooling is one of the most prevalent methods of liquid cooling used in data centres today.
Lastly, the use of RDHx is another practice that could be adopted in the data centre, although typically RDHx solutions are widely regarded as a better form of air cooling, rather than a liquid cooling solution. This is a concept where the rear door of an entire rack will contain both liquid and fans, which combine to exhaust and cool any hot air that is developed from the components within the rack.
Ultimately, whichever cooling solution is favoured, it will typically need to be assessed on a case-by-case basis, but it’s safe to say that if a liquid cooling approach is adopted, then this will have a significant impact on keeping the latest generation of CPUs and GPUs within their ambient operating temperatures, as well as future proofing data centre environments for years to come.