Why Do We Need Warm-up? A Theoretical Perspective

Generative AI & LLMs
Published: arXiv: 2510.03164v1
Authors

Foivos Alimisis Rustem Islamov Aurelien Lucchi

Abstract

Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the $(L_0, L_1)$-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.

Paper Summary

Problem
Modern deep learning models require careful hyperparameter tuning, particularly for the learning rate (LR). One common practice is to increase the LR at the beginning of training (warm-up stage) and decrease it later (decay stage). However, the theoretical foundations of warm-up remain poorly understood, and its benefits are fragmented and unclear.
Key Innovation
This research provides a principled explanation for why warm-up improves training. The authors rely on a generalization of the (L0, L1)-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality. They demonstrate that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses.
Practical Impact
The findings of this research have significant implications for the development of deep learning models. By providing a theoretical justification for warm-up, the authors establish upper and lower complexity bounds for Gradient Descent with a warm-up schedule. This means that warm-up can be used to improve the convergence of deep learning models, leading to faster and more efficient training.
Analogy / Intuitive Explanation
Imagine a runner starting a marathon. At the beginning, they need to warm up their muscles to get ready for the long run. Similarly, when training a deep learning model, the learning rate needs to be increased gradually at the beginning to help the model adapt to the complex task at hand. As the model progresses through training, the learning rate can be decreased to avoid overfitting and improve convergence. This analogy illustrates the importance of warm-up in deep learning and the theoretical justification provided by this research.
Paper Information
Categories:
cs.LG math.OC stat.ML
Published Date:

arXiv ID:

2510.03164v1

Quick Actions