DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

Agentic AI
Published: arXiv: 2510.15110v1
Authors

Shih-Yang Liu Xin Dong Ximing Lu Shizhe Diao Mingjie Liu Min-Hung Chen Hongxu Yin Yu-Chiang Frank Wang Kwang-Ting Cheng Yejin Choi Jan Kautz Pavlo Molchanov

Abstract

Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.

Paper Summary

Problem
The main problem this research paper addresses is how to make large language models (LLMs) more efficient in their reasoning without sacrificing accuracy. Current LLMs often use long chains of thought to achieve strong performance, but this comes at the cost of heavy token usage, higher latency, and redundant outputs. The goal is to maximize intelligence per token.
Key Innovation
The research paper introduces a new approach called Doing Length pEnalty Right (DLER) that uses reinforcement learning to incentivize more intelligence per token. DLER achieves state-of-the-art accuracy-to-length ratios using a simple truncation penalty, significantly outperforming prior approaches. The innovation lies in the careful integration of effective reinforcement learning optimization techniques to address the issues that previously limited the accuracy of length penalty approaches.
Practical Impact
The DLER approach has several practical implications. Firstly, it can be applied to develop more accurate and efficient reasoning models that are accessible to a wider range of users, including those without access to high-quality proprietary data. Secondly, it enables better test-time scaling, allowing models to be more responsive and efficient in real-world applications. Finally, the research suggests that improving reasoning efficiency depends more on optimization strategies than on complex penalty designs, pointing towards new directions for developing more accurate, efficient, and accessible reasoning models.
Analogy / Intuitive Explanation
Imagine trying to solve a puzzle. A traditional approach might involve using a lot of steps and trial-and-error to find the solution. However, with DLER, the model is incentivized to find the solution more efficiently, using fewer steps and less "thinking" overall. This approach can lead to more accurate and efficient solutions, without sacrificing the quality of the outcome. In this sense, DLER can be thought of as a "puzzle-solving" algorithm that optimizes the efficiency of the solution process.
Paper Information
Categories:
cs.LG cs.AI cs.CL
Published Date:

arXiv ID:

2510.15110v1

Quick Actions