Understanding Efficiency: Quantization, Batching, and Serving Strategies in LLM Energy Use

Explainable & Ethical AI
Published: arXiv: 2601.22362v1
Authors

Julien Delavande Regis Pierrard Sasha Luccioni

Abstract

Large Language Models (LLMs) are increasingly deployed in production, contributing towards shifting the burden in terms of computational resources and energy demands from training to inference. While prior work has examined the energy cost of inference per prompt or per token, we highlight how \emph{system-level design choices} - such as numerical precision, batching strategy, and request scheduling - can lead to orders-of-magnitude differences in energy consumption for the same model. We perform a detailed empirical study of LLM inference energy and latency on NVIDIA H100 GPUs, analyzing the impact of quantization, batch size, and serving configuration (e.g., with Hugging Face's Text Generation Inference server). Our results reveal that lower-precision formats only yield energy gains in compute-bound regimes; that batching improves energy efficiency, especially in memory-bound phases like decoding; and that structured request timing (arrival shaping) can reduce per-request energy by up to 100 times. We argue that sustainable LLM deployment depends not only on model internals, but also on the orchestration of the serving stack. Our findings motivate phase-aware energy profiling and system-level optimizations for greener AI services.

Paper Summary

Problem
Large Language Models (LLMs) are increasingly being used in production, which has shifted the focus from training to inference. However, the energy consumption during inference has become a growing concern. While prior work has examined the energy cost of inference per prompt or per token, the paper highlights the impact of system-level design choices, such as numerical precision, batching strategy, and request scheduling, on energy consumption.
Key Innovation
The paper presents a detailed empirical study of LLM inference energy and latency on NVIDIA H100 GPUs, analyzing the impact of quantization, batch size, and serving configuration on energy consumption. The study reveals that lower-precision formats only yield energy gains in compute-bound regimes, and that batching improves energy efficiency, especially in memory-bound phases like decoding. Additionally, the paper shows that structured request timing (arrival shaping) can reduce per-request energy by up to 100×.
Practical Impact
The findings of this paper have significant practical implications for the deployment of LLMs. By understanding the impact of system-level design choices on energy consumption, developers can optimize their models and serving configurations to reduce energy consumption and carbon footprint. This is particularly important as LLMs are increasingly being used in user-facing applications, such as chatbots and assistants, which can have a significant impact on energy consumption. The paper's findings can also inform the development of more energy-efficient AI services and help mitigate the environmental impact of AI.
Analogy / Intuitive Explanation
Imagine a large restaurant where multiple orders are being prepared simultaneously. Each order is like a prompt being processed by the LLM. If the restaurant is not organized efficiently, with orders being processed one by one, it will take a long time to complete all the orders, and the kitchen will be inefficient. Similarly, if the LLM is not optimized for batching and serving, it will consume more energy and be less efficient. By optimizing the batching and serving strategies, the restaurant (or the LLM) can process orders more efficiently, reducing energy consumption and increasing productivity.
Paper Information
Categories:
cs.LG
Published Date:

arXiv ID:

2601.22362v1

Quick Actions