Why Do We Need Warm-up? A Theoretical Perspective

Generative AI & LLMs

Published: arXiv: 2510.03164v1

Authors

Foivos Alimisis Rustem Islamov Aurelien Lucchi

Abstract

Learning rate warm-up - increasing the learning rate at the beginning of training - has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the $(L_0, L_1)$-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.

Paper Summary

Problem

Modern deep learning models require careful hyperparameter tuning, particularly for the learning rate (LR). One common practice is to increase the LR at the beginning of training (warm-up stage) and decrease it later (decay stage). However, the theoretical foundations of warm-up remain poorly understood, and its benefits are fragmented and unclear.

Key Innovation

This research provides a principled explanation for why warm-up improves training. The authors rely on a generalization of the (L0, L1)-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality. They demonstrate that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses.

Practical Impact

The findings of this research have significant implications for the development of deep learning models. By providing a theoretical justification for warm-up, the authors establish upper and lower complexity bounds for Gradient Descent with a warm-up schedule. This means that warm-up can be used to improve the convergence of deep learning models, leading to faster and more efficient training.

Analogy / Intuitive Explanation

Imagine a runner starting a marathon. At the beginning, they need to warm up their muscles to get ready for the long run. Similarly, when training a deep learning model, the learning rate needs to be increased gradually at the beginning to help the model adapt to the complex task at hand. As the model progresses through training, the learning rate can be decreased to avoid overfitting and improve convergence. This analogy illustrates the importance of warm-up in deep learning and the theoretical justification provided by this research.

Paper Information

Categories:

cs.LG math.OC stat.ML

Published Date:

arXiv ID:

2510.03164v1

Quick Actions

Back to Home