Modeling Energy Consumption in Deep Learning Architectures Using Power Laws

Paper explained · Green AI · Energy-aware deep learning · ECAI 2025

Mansour Zoubeirou a Mayaki and Victor Charpenay · 28th European Conference on Artificial Intelligence (ECAI 2025)

Energy estimation Power laws Hardware efficiency Transformers LSTM / GRU

DOI HAL Code Dataset / appendix

In one sentence

This paper proposes a practical way to estimate the training energy consumption of deep learning architectures from their computational operations, using hardware efficiency factors modeled by power-law-like saturation curves.

Main idea. Instead of measuring energy only after training a model, energy can be estimated earlier from the model design: number of operations, GPU throughput, and how efficiently each operation uses the hardware.

Why this problem matters

Modern deep learning models are increasingly expensive to train. Architectures such as Transformers, LSTMs, and GRUs can require large amounts of computation, which translates into energy consumption, financial cost, and environmental impact.

A central difficulty is that FLOPs alone do not fully explain energy use. Two operations with similar FLOP counts may consume different amounts of energy because the GPU may execute them with different efficiency.

Key question. Can the energy consumption of a model be estimated before training, using only architectural information and a small set of hardware-dependent efficiency laws?

Method overview

The proposed approach decomposes a neural architecture into elementary operations. Each operation is associated with a computational cost and a hardware efficiency factor. These quantities are then combined to estimate the operation duration and the total energy consumption.

From architecture to estimated energy, without running a full benchmark for every candidate model.

The core equations

1. Hardware efficiency factor

The hardware efficiency factor compares the actual throughput of an operation with the maximum theoretical throughput of the GPU:

\[ \eta_{\Theta} = \frac{c/t}{v_{\max}} \]

Here, \(c\) is the computational cost in FLOPs, \(t\) is the operation duration, and \(v_{\max}\) is the maximum theoretical throughput of the GPU.

2. Duration of an operation

Once the efficiency is known, the duration of an operation can be estimated as:

\[ t_i(c_i) = \frac{c_i}{v_{\max}\,\eta_{\Theta_i}} \]

3. Power-law-like efficiency model

The paper models hardware efficiency as a saturation curve depending on the operation FLOP count:

\[ \eta_{\Theta}(c) \approx \eta_{\max}\left(1 - e^{-k c^{\alpha}}\right) \]

The parameter \(\eta_{\max}\) represents the maximum efficiency level, while \(k\) and \(\alpha\) control how fast the efficiency approaches this saturation level.

4. Energy estimation

Total energy is modeled as a linear combination of estimated operation durations:

\[ e \approx h + \sum_{i=1}^{n} h_i\,t_i(c_i) \]

What was evaluated?

Architectures

Transformers, LSTMs and GRUs, covering both attention-based and recurrent models.

Hardware

Experiments include NVIDIA A100 80GB PCIe and GeForce RTX 2080 Ti GPUs.

Measurements

Energy, runtime, FLOPs and operation-level efficiency were analyzed across many configurations.

The experiments vary model parameters such as number of layers, hidden dimensions, sequence length, embedding dimension, batch size, and attention heads.

Main findings

Hardware efficiency is essential. FLOPs alone are not sufficient because different operations use the GPU with different efficiency.
Efficiency saturates. Larger workloads often use the GPU more efficiently, but the improvement eventually reaches a plateau.
The proposed model is accurate. For Transformer models, the reported regression achieves a high coefficient of determination (\(R^2 \approx 0.96\)). A test evaluation across heterogeneous GPU platforms reports \(R^2 \approx 0.98\).
Model depth and dimensionality matter strongly. For Transformers, the number of layers and the model dimension have a much larger influence on energy than the number of attention heads.
Energy-aware selection becomes possible. In the example discussed in the paper, a larger Transformer configuration is estimated to consume more than twice the energy per epoch of a smaller configuration under the same input settings.

Example: energy-aware model selection

The paper compares two Transformer configurations under the same input conditions. Model A has fewer layers and a smaller model dimension, while Model B is deeper and wider.

Model A

6 layers, \(d_{\mathrm{model}}=512\), 8 attention heads.

Estimated energy: about \(36\) J per epoch.

Model B

12 layers, \(d_{\mathrm{model}}=768\), 12 attention heads.

Estimated energy: about \(79\) J per epoch.

Interpretation

The larger model costs more than twice as much energy per epoch.

Practical message. The method supports early architectural comparison: several candidate models can be compared from their design parameters before investing in full training runs.

Why this contribution is useful

The method is useful for Green AI because it moves energy estimation earlier in the model design process. Researchers and engineers can compare architectures using estimated energy consumption, rather than relying only on accuracy, parameter count, or FLOPs.

This is particularly relevant when training resources are limited, when deployment is constrained by energy, or when sustainability objectives are part of model selection.

Limitations and future directions

The experiments are conducted on single-GPU setups, while very large models often use distributed training.
Multi-GPU communication, load balancing, memory bandwidth, and data sharding can affect energy consumption.
Future extensions may include mixed precision training, more hardware platforms, and emerging architectures such as Mixture-of-Experts models.

How to cite

Mansour Zoubeirou a Mayaki and Victor Charpenay. Modeling Energy Consumption in Deep Learning Architectures Using Power Laws. ECAI 2025. DOI: 10.3233/FAIA250900.