MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models

Generative AI & LLMs
Published: arXiv: 2508.13148v1
Authors

Haoyu He Katrin Renz Yong Cao Andreas Geiger

Abstract

Diffusion language models, as a promising alternative to traditional autoregressive (AR) models, enable faster generation and richer conditioning on bidirectional context. However, they suffer from a key discrepancy between training and inference: during inference, MDLMs progressively reveal the structure of the generated sequence by producing fewer and fewer masked tokens, whereas this structure is ignored in training as tokens are masked at random. Although this discrepancy between training and inference can lead to suboptimal performance, it has been largely overlooked by previous works, leaving closing this gap between the two stages an open problem. To address this, we frame the problem of learning effective denoising trajectories as a sequential decision-making problem and use the resulting framework to apply reinforcement learning. We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion possesses and explicitly train the model under the same progressive refining schedule used at inference. MDPO matches the performance of the previous state-of-the-art (SOTA) method with 60x fewer gradient updates, while achieving average improvements of 9.6% on MATH500 and 54.2% on Countdown over SOTA when trained within the same number of weight updates. Additionally, we improve the remasking strategy of MDLMs as a plug-in inference replacement to overcome the limitation that the model cannot refine tokens flexibly. This simple yet effective training-free strategy, what we refer to as RCR, consistently improves performance and yields additional gains when combined with MDPO. Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs. Code: https://github.com/autonomousvision/mdpo. Project Page: https://cli212.github.io/MDPO/.

Paper Summary

Problem
The main problem addressed in this paper is the "training-inference divide" in Masked Diffusion Language Models (MDLMs). During training, tokens are randomly masked, but during inference, the model progressively reveals the structure of the generated sequence by producing fewer and fewer masked tokens. This discrepancy between training and inference can lead to suboptimal performance.
Key Innovation
The key innovation is the proposal of a novel Masked Diffusion Policy Optimization (MDPO) framework that uses reinforcement learning to optimize denoising trajectories with intermediate rewards. MDPO explicitly trains the model under the same progressive refining schedule used at inference, addressing the training-inference divide overlooked by previous works.
Practical Impact
The practical impact is significant. MDPO matches the performance of the state-of-the-art method with 60× fewer gradient updates and achieves average improvements of 9.6% on MATH500 and 54.2% on Countdown over SOTA when trained within the same number of weight updates. The Running Confidence Remasking (RCR) strategy, which is a plug-in inference replacement, also consistently improves performance.
Analogy / Intuitive Explanation
Think of MDLMs as a puzzle-solving process. During training, you randomly mask some of the puzzle pieces to help the model learn to fill in the blanks. But during inference, you progressively reveal the correct answers by removing the masks. MDPO is like optimizing the puzzle-solving strategy by learning from intermediate rewards and adjusting the denoising trajectory accordingly. Note: The analogy is not perfect, but it gives a rough idea of how MDLMs work and what MDPO does to improve their performance.
Paper Information
Categories:
cs.LG
Published Date:

arXiv ID:

2508.13148v1

Quick Actions