← Back to homepage

Modeling Energy Consumption in Deep Learning Architectures Using Power Laws

Paper explained · Green AI · Energy-aware deep learning · ECAI 2025
Mansour Zoubeirou a Mayaki and Victor Charpenay · 28th European Conference on Artificial Intelligence (ECAI 2025)
Energy estimation Power laws Hardware efficiency Transformers LSTM / GRU

In one sentence

This paper proposes a practical way to estimate the training energy consumption of deep learning architectures from their computational operations, using hardware efficiency factors modeled by power-law-like saturation curves.

Main idea. Instead of measuring energy only after training a model, energy can be estimated earlier from the model design: number of operations, GPU throughput, and how efficiently each operation uses the hardware.

Why this problem matters

Modern deep learning models are increasingly expensive to train. Architectures such as Transformers, LSTMs, and GRUs can require large amounts of computation, which translates into energy consumption, financial cost, and environmental impact.

A central difficulty is that FLOPs alone do not fully explain energy use. Two operations with similar FLOP counts may consume different amounts of energy because the GPU may execute them with different efficiency.

Key question. Can the energy consumption of a model be estimated before training, using only architectural information and a small set of hardware-dependent efficiency laws?

Method overview

The proposed approach decomposes a neural architecture into elementary operations. Each operation is associated with a computational cost and a hardware efficiency factor. These quantities are then combined to estimate the operation duration and the total energy consumption.

Energy estimation workflow The workflow starts from a model architecture, decomposes it into operations, estimates FLOPs and hardware efficiency, predicts operation durations, and estimates energy. Model architecture and workload Operations QKV, scores, projection, gates HEF law efficiency as a function of FLOPs Duration estimated for each operation Energy
From architecture to estimated energy, without running a full benchmark for every candidate model.

The core equations

1. Hardware efficiency factor

The hardware efficiency factor compares the actual throughput of an operation with the maximum theoretical throughput of the GPU:

\[ \eta_{\Theta} = \frac{c/t}{v_{\max}} \]

Here, \(c\) is the computational cost in FLOPs, \(t\) is the operation duration, and \(v_{\max}\) is the maximum theoretical throughput of the GPU.

2. Duration of an operation

Once the efficiency is known, the duration of an operation can be estimated as:

\[ t_i(c_i) = \frac{c_i}{v_{\max}\,\eta_{\Theta_i}} \]

3. Power-law-like efficiency model

The paper models hardware efficiency as a saturation curve depending on the operation FLOP count:

\[ \eta_{\Theta}(c) \approx \eta_{\max}\left(1 - e^{-k c^{\alpha}}\right) \]

The parameter \(\eta_{\max}\) represents the maximum efficiency level, while \(k\) and \(\alpha\) control how fast the efficiency approaches this saturation level.

4. Energy estimation

Total energy is modeled as a linear combination of estimated operation durations:

\[ e \approx h + \sum_{i=1}^{n} h_i\,t_i(c_i) \]

What was evaluated?

Architectures
Transformers, LSTMs and GRUs, covering both attention-based and recurrent models.
Hardware
Experiments include NVIDIA A100 80GB PCIe and GeForce RTX 2080 Ti GPUs.
Measurements
Energy, runtime, FLOPs and operation-level efficiency were analyzed across many configurations.

The experiments vary model parameters such as number of layers, hidden dimensions, sequence length, embedding dimension, batch size, and attention heads.

Main findings

Example: energy-aware model selection

The paper compares two Transformer configurations under the same input conditions. Model A has fewer layers and a smaller model dimension, while Model B is deeper and wider.

Model A
6 layers, \(d_{\mathrm{model}}=512\), 8 attention heads.
Estimated energy: about \(36\) J per epoch.
Model B
12 layers, \(d_{\mathrm{model}}=768\), 12 attention heads.
Estimated energy: about \(79\) J per epoch.
Interpretation
The larger model costs more than twice as much energy per epoch.
Practical message. The method supports early architectural comparison: several candidate models can be compared from their design parameters before investing in full training runs.

Why this contribution is useful

The method is useful for Green AI because it moves energy estimation earlier in the model design process. Researchers and engineers can compare architectures using estimated energy consumption, rather than relying only on accuracy, parameter count, or FLOPs.

This is particularly relevant when training resources are limited, when deployment is constrained by energy, or when sustainability objectives are part of model selection.

Limitations and future directions

How to cite

Mansour Zoubeirou a Mayaki and Victor Charpenay. Modeling Energy Consumption in Deep Learning Architectures Using Power Laws. ECAI 2025. DOI: 10.3233/FAIA250900.