Normalizing Trajectory Models

Generative AI & LLMs
Published: arXiv: 2605.08078v1
Authors

Jiatao Gu Tianrong Chen Ying Shen David Berthelot Shuangfei Zhai Josh Susskind

Abstract

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.

Paper Summary

Problem
The main problem the paper addresses is the limitation of current diffusion-based models in generating high-quality images when the number of sampling steps is reduced. This is due to the single-Gaussian assumption, which becomes inaccurate when each transition spans a larger interval.
Key Innovation
The key innovation of this paper is the introduction of Normalizing Trajectory Models (NTM), a framework that models the non-Gaussian reverse conditional p(xs | xt) as a conditional normalizing flow with exact log-likelihood. This is achieved by learning a latent space via an invertible transporter, where the reverse conditional becomes simple enough to be modeled by a Gaussian predictor.
Practical Impact
The NTM framework has the potential to improve the efficiency of image generation while maintaining high-quality results. By allowing for few-step generation with an exact likelihood model of the reverse process, NTM can be applied in various real-world applications, such as: * Real-time image generation for video games or virtual reality * Efficient image generation for large-scale data processing or scientific simulations * Improved image quality for applications that require high-fidelity images, such as medical imaging or remote sensing
Analogy / Intuitive Explanation
Imagine a path from a starting point to a destination. Traditional diffusion-based models assume that this path is a straight line, which is accurate when the path is short. However, when the path is long, it becomes a complex curve that cannot be accurately represented by a straight line. NTM is like a GPS system that learns to represent this complex curve as a series of smaller, more manageable segments, allowing for more accurate navigation (image generation) while maintaining the exact likelihood of the path. --- * The NTM framework builds upon existing diffusion-based models and flow matching techniques, making it a natural extension of current research in image generation. * The paper provides a clear and concise explanation of the NTM framework and its key innovations, making it accessible to a broad audience. * The experimental results demonstrate the effectiveness of NTM in achieving few-step generation with high-quality results, making it a promising approach for various real-world applications.
Paper Information
Categories:
cs.CV cs.LG
Published Date:

arXiv ID:

2605.08078v1

Quick Actions