Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates

Generative AI & LLMs

Published: arXiv: 2511.11466v1

Authors

Dmitry Kovalev Ekaterina Borodich

Abstract

Recently, several instances of non-Euclidean SGD, including SignSGD, Lion, and Muon, have attracted significant interest from the optimization community due to their practical success in training deep neural networks. Consequently, a number of works have attempted to explain this success by developing theoretical convergence analyses. Unfortunately, these results cannot properly justify the superior performance of these methods, as they could not beat the convergence rate of vanilla Euclidean SGD. We resolve this important open problem by developing a new unified convergence analysis under the structured smoothness and gradient noise assumption. In particular, our results indicate that non-Euclidean SGD (i) can exploit the sparsity or low-rank structure of the upper bounds on the Hessian and gradient noise, (ii) can provably benefit from popular algorithmic tools such as extrapolation or momentum variance reduction, and (iii) can match the state-of-the-art convergence rates of adaptive and more complex optimization algorithms such as AdaGrad and Shampoo.

Paper Summary

Problem

Deep learning models have become increasingly large and complex, leading to significant growth in model sizes and algorithmic memory requirements. As a result, the optimization community has started shifting its attention to more memory-efficient, non-Euclidean variants of stochastic gradient descent (SGD). However, despite their practical effectiveness, the theoretical understanding of these methods remains highly limited.

Key Innovation

This research paper presents a new unified convergence analysis for non-Euclidean SGD, which can exploit the sparsity or low-rank structure of the upper bounds on the Hessian and gradient noise. The analysis is formalized in Algorithm 1 and can match the state-of-the-art convergence rates of adaptive and more complex optimization algorithms such as AdaGrad and Shampoo.

Practical Impact

The results of this paper can be applied in the real world to improve the training of deep neural networks, particularly for large language models (LLMs). Non-Euclidean SGD methods, such as SignSGD, Lion, and Muon, can be used to reduce memory requirements and improve convergence rates. This can lead to faster and more efficient training of complex models.

Analogy / Intuitive Explanation

Imagine you're trying to find the shortest path between two points in a complex landscape. Traditional Euclidean methods would use a straight-line approach, but non-Euclidean methods can use a more efficient path that takes into account the structure of the landscape. In this case, the landscape is the objective function of the optimization problem, and the non-Euclidean methods can use the structure of the function to find a more efficient path to the solution.

Paper Information

Categories:

math.OC cs.LG

Published Date:

arXiv ID:

2511.11466v1

Quick Actions

Back to Home