Supercomputer Cooling: Why the ‘Sub-Zero’ Rooms Matter

 • 

9 min read

 • 



Supercomputer cooling is not a curiosity—it’s the practical limit on how fast and densely we can pack compute. Modern high‑performance systems throw off megawatts of heat in a few square metres, and the way that heat is removed changes energy costs, reliability and whether waste heat can be reused. This article shows how air, direct‑to‑chip liquid and immersion approaches differ, what operators measure (for example PUE and FLOPS/W), and why “sub‑zero” cooling rooms have become a strategic part of running the world’s most powerful computers.

Introduction

When a supercomputer is doing heavy work—climate models, genome assemblies, or training AI models—most of the electricity it consumes becomes heat. A single rack can produce tens of kilowatts, and whole sites can draw several megawatts. The physical challenge is straightforward: remove that heat reliably without wasting large amounts of additional energy. The technical and economic answers vary, but the trend is clear: as compute density grows, the cost and complexity of cooling become a central part of any facility decision.

This introduction separates three practical concerns: raw heat density (how many kilowatts per rack), energy accounting (how cooling shows up in metrics such as power usage effectiveness, PUE) and hardware choices (air vs liquid vs immersion). Each choice affects how much energy a system uses for the same amount of computation and whether the site’s waste heat can be captured for heating nearby buildings or industrial processes.

Why supercomputers need extreme cooling

Supercomputers need extreme cooling because their components work at very high power density. A CPU or GPU in a desktop might use a few tens to a few hundred watts; a rack in an HPC installation can concentrate 20–40 kW or more into a cabinet roughly the size of a wardrobe. That creates two immediate issues: first, components must stay within safe temperature and voltage ranges to avoid errors and shorten lifetimes; second, removing heat efficiently at that scale challenges traditional air‑based approaches.

Two standard metrics help compare solutions. PUE, power usage effectiveness, is the ratio of total facility power to IT equipment power. A PUE of 1.0 would mean no extra facility overhead; typical modern data centres run between about 1.05 and 1.4 depending on cooling and climate. FLOPS/W (floating‑point operations per watt) is a system efficiency measure used in lists such as the Green500; it bundles compute and facility power to give a sense of energy efficiency for real workloads. Both matter because a small change in cooling energy at megawatt scale can change operating costs by hundreds of thousands of euros per year.

For very dense systems, cooling is not an add‑on. It becomes part of the system design and often determines whether an installation is economically viable.

The main technical options today are:

– Air cooling: fans move ambient air across heat sinks. It is simple and cheap at low density but struggles as rack power rises.

– Direct‑to‑chip liquid cooling (D2C): cold plates or channels remove heat directly from CPUs/GPUs. Liquid has far higher heat capacity than air so it scales better with density.

– Immersion cooling: servers or components are submerged in a dielectric fluid. Two‑phase immersion lets the fluid boil and condense, carrying large heat loads with minimal pumping.

To make the comparison concrete, the table below summarises typical trade‑offs.

Feature Description Practical note
Air cooling Fans and CRAC units move and cool air Best for lower density; PUE impact rises with rack kW
Direct‑to‑chip liquid Cold plates or piping extract heat at the source Enables 20–40+ kW/rack, lowers fan energy
Immersion (single/two‑phase) Components immersed in dielectric fluid; two‑phase uses boiling/condensing High density and very low fan energy; specialised maintenance

Manufacturer reports and independent studies show that liquid approaches can reduce the portion of site power used for cooling significantly. For example, vendor documents issued in recent years describe large reductions in fan and chiller loads when shifting from air to direct liquid systems. At the same time, two‑phase immersion prototypes reported PUEs close to 1.02 in research settings, a figure well below typical air‑cooled sites—though those results come from controlled installations and need field validation for wide adoption.

How cooling shapes real data‑center practice

Cooling choices determine more than component temperature: they influence layout, power distribution, operational costs and even whether waste heat can be reused. When a centre installs liquid cooling, rack spacing, leak detection, power delivery and maintenance procedures must be redesigned. Those are one‑time and recurring costs that factor into total cost of ownership (TCO).

A concrete example: at high density, air systems need powerful fans and chillers that together add to facility power draw. That pushes PUE up and reduces net FLOPS/W. Direct‑to‑chip and immersion options remove much of the airflow burden and often permit warmer return temperatures for heat exchangers. Warmer return temperatures are valuable because they allow cheaper heat rejection systems and increase the options to use the waste heat for district heating or industrial processes.

In practice, operators run pilots before large rollouts. Pilots measure sustained throughput and system‑level metrics such as Green500‑style FLOPS/W and facility PUE under representative workloads. Those measurements help separate short bursts from sustained performance: a chip may show high peak throughput (measured in TOPS) but throttle under continuous load if cooling or power delivery is insufficient. Independent benchmarks and Green500 entries therefore remain an important cross‑check when comparing cooling approaches and hardware claims.

Another operational angle is repairability and downtime. Air‑cooled systems are familiar to most technicians; immersion and direct‑to‑chip systems require new skills and spare‑parts planning. Some operators accept that trade‑off because the energy savings and space density deliver clear benefits at scale. Others keep a hybrid mix: liquid cooling for the densest, mission‑critical racks and air for less dense workloads.

For a reader interested in device‑level trends—how processor and system design interacts with facility choices—our coverage of recent client and edge hardware shows similar patterns: hardware shifts require software and operating practices to catch up. See TechZeitGeist’s coverage of hardware trends for how silicon and facility choices link up with real applications: AI PCs at CES and the Hardware & Gadgets coverage.

Opportunities, risks and practical tensions

Liquid cooling and immersion offer energy and density benefits, but they introduce new operational risks and supply‑chain questions. One risk is chemical: some immersion fluids used in two‑phase systems are fluorinated and have environmental concerns. Selecting fluids therefore requires an assessment of global warming potential and end‑of‑life handling. Independent life‑cycle analyses remain limited in the public literature, so operators must ask vendors for transparent chemistry and disposal plans.

Reliability and serviceability are another tension. Immersion reduces fan and thermal cycling stress, which can increase component lifetime in some models, but it also changes failure modes—repairing a submerged board or replacing an immersed module requires specialized processes. Direct‑to‑chip systems lower airborne particulate risk and can improve mean time between failures for hot components, yet they demand rigorous leak management and trained staff.

Economics are subtle. Manufacturer presentations since 2024 have reported large reductions in cooling power for liquid solutions; however, those figures often depend on the baseline configuration and assumed workloads. Independent research on immersion prototypes shows large PUE improvements in prototype systems, and modelling suggests possible TCO benefits when accounting for space savings and energy reductions. Still, the full picture must include CAPEX for new racks, chillers, or tank systems and OPEX for maintenance, fluid replacement and staff training.

Finally, governance and standards matter. If each vendor ships a different model format or update mechanism for embedded control systems, reproducibility and auditability become harder. For public‑interest workloads—scientific or climate modelling—operators and funders should prioritise transparent measurement protocols (for example, Green500‑style FLOPS/W runs and full PUE disclosure) so comparisons are reproducible across sites.

Trends ahead and what operators test

The next few years will likely see three parallel developments. First, hybrid cooling: many facilities will mix direct‑to‑chip racks for the densest workloads with air‑cooled racks for general tasks. Second, better reuse of waste heat: as return temperatures climb in liquid systems, more operators will pilot heat‑capture projects tied to local heating networks. Third, measurement standardisation: purchasers will demand sustained throughput and third‑party verification of energy claims rather than peak numbers alone.

Practical tests to watch for include: side‑by‑side racks under identical workloads (air vs D2C vs immersion) measuring per‑rack power, facility PUE, and sustainable FLOPS/W; leak and maintenance incident rates over multi‑month periods; and life‑cycle environmental assessments for the chosen fluids and materials. These tests reveal how theoretical gains translate into real savings and risks.

For universities, national labs and companies planning upgrades the short checklist is straightforward: run a pilot, measure sustained metrics under representative loads, include third‑party validation and model the CAPEX/OPEX balance for several years. Where possible, design for modular change—if density or software needs shift, make it possible to convert or repurpose racks without full rebuilds.

Conclusion

Cooling is a decisive factor for supercomputer performance, costs and sustainability. As compute density rises, traditional air systems reach practical and economic limits; direct‑to‑chip liquid and immersion cooling move those limits farther out by carrying heat away more efficiently and offering new options for heat reuse. However, the benefits depend on careful measurement of sustained throughput, transparent environmental and maintenance practices, and realistic accounting of CAPEX and OPEX. Operators who pilot and measure under real workloads will be best placed to judge whether a move to extreme cooling is worthwhile.

In short: the “sub‑zero” rooms matter because they are where compute, physics and economics meet—how we remove heat today shapes what we can run tomorrow.


We welcome your experiences and questions about cooling choices—share this article and join the discussion.


Leave a Reply

Your email address will not be published. Required fields are marked *

In this article

Newsletter

The most important tech & business topics – once a week.

Wolfgang Walk Avatar

More from this author

Newsletter

Once a week, the most important tech and business takeaways.

Short, curated, no fluff. Perfect for the start of the week.

Note: Create a /newsletter page with your provider embed so the button works.