Using Machine Learning to Efficiently Cool Data Centers
30–40% of power costs in a data center go to cooling. This README describes a data-driven methodology to model, simulate, and automatically optimize cooling policies to reduce PUE while honoring operational constraints.
Github Repo
Table of Contents
Problem Statement
30–40% of power costs in a data center go to cooling. Average PUE of data centers in India is over 1.7. Though data centers have state-of-the-art cooling systems from leading vendors, they can be managed more efficiently due to:
- Hard-to-model environments: There’s no safe place to experiment with new settings (e.g., setpoints).
- Coarse, localized policies: Often tuned for peak IT load and ignore cross-system interactions.
- Reactive/static control: E.g., lowering AC temperature only after a zone heats up.
Introduction
Past sensor and control data can be used to model data centers. These models can accurately predict the impact of changing setpoints and can be used to test or even auto-generate new policies.
Background
Google successfully deployed machine-learning-based cooling control in their data centers, reportedly saving ~40% of cooling energy and achieving very low PUE (≈ 1.06). See:
Our Methodology

Using data to model and then control
Modelling
- Most DC equipment (via Modbus/SNMP) feeds data to DCIM or similar systems (e.g., rack temps, IT load, AHU fan speeds).
- We train deep neural networks on this time-series data to learn a function mapping setpoints + sensor readings → cooling power usage.
- Objective: accurate aggregate prediction of energy usage in the next 1 hour.
Model as proxy
- Direct experimentation is risky.
- The trained model acts as a virtual testbed to estimate PUE impacts of changes (e.g., +1 °C PAC setpoint) without touching production.
Model testing
- Deployed in a containerized environment (cloud or on-prem).
- DC personnel can safely experiment with policies using the model. This is a risk-free stage.
Generating policies
- We use reinforcement learning (RL) to derive control strategies for specific scenarios.
- The RL agent is rewarded for reducing PUE while respecting constraints.
- Generated policies are pushed to DCIM for review or direct deployment.
Experiments
Strategy Simulations
Consider 4 racks in a cold-aisle/hot-aisle layout. Each rack intakes from the cold aisle and exhausts to the hot aisle.
- Floor tiles are controllable, with a tile temperature setpoint of 0.1 (10% of max allowable).
- All racks start at 0.9 (90% of max allowable).
- The model is trained for a day to bring rack temperatures down to 0.8, minimizing time and energy.
What does the model learn?
-
After 10,000 iterations (video):
- Cools some servers in Rack 2.
- Leverages cool air in the middle hot aisle to cool Rack 3.
- Chooses some suboptimal tiles and keeps them on.
-
After 20,000 iterations (
):
- Learns to switch tiles on/off.
- Optimizes tile damping.
EnergyPlus Simulations
We use EnergyPlus to simulate building energy consumption and override its control system:
- Build digital twins of existing DCs.
- Generate sensor/control data for training and architecture design.
- Acts as a safe experimentation platform.
Simulated DC floor
Integration with DCIM
- Runs in cloud, on a DC workstation, or as a DCIM plugin.
- Requires API access to DCIM data (or a portal for periodic data uploads).
Launch
Before launch, we collect operational constraints (e.g., temperature at rack < 22 °C). All policies enforce constraints at all times.
- Stage 1 – Assisted generation (slow):
Generate policies every 6 hours, reviewed by SMEs, deployed via DCIM, continuously monitored.
- Stage 2 – Faster assisted:
Generate policies ~hourly; still manual approval required.
- Stage 3 – Auto-deploy with daily verification:
If ≥ 99% pass in Stage 2, move to automatic DCIM deployment; daily policy verification and continuous monitoring.
- Stage 4 – Full autonomy with fallback:
System takes control; team retains revert capability for events.
- Performance monitoring:
Track all systems/constraints and quantify energy savings.
Data Required for Modelling
The list below is illustrative. Missing or extra parameters do not materially affect savings potential.
Sensory data
- Rack temperature & humidity (necessary)
- Outside air temperature & humidity (necessary, can be sourced externally if not logged)
Cooling system data
Overall metrics
- PUE
- IT load
- Power distribution across AHUs, chillers, cooling towers, etc.