Mode Seeking meets Mean Seeking for Fast Long Video Generation

Generative AI & LLMs

Published: arXiv: 2602.24289v1

Authors

Shengqu Cai Weili Nie Chao Liu Julius Berner Lvmin Zhang Nanye Ma Hansheng Chen Maneesh Agrawala Leonidas Guibas Gordon Wetzstein Arash Vahdat

View on arXiv Download PDF View Graph

Abstract

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.

Paper Summary

Problem

The main challenge in this research paper is scaling up video generation from short clips to long, coherent sequences that can last for minutes. This is a difficult task because there is a scarcity of high-quality, long-form video data, making it hard to train models that can generate such videos.

Key Innovation

The authors propose a new training paradigm called "Mode Seeking meets Mean Seeking," which decouples local fidelity from long-term coherence. This is achieved using a Decoupled Diffusion Transformer (DDT) that consists of two separate heads: a global Flow Matching head and a local Distribution Matching head. The Flow Matching head learns to capture narrative structure from long videos, while the Distribution Matching head aligns generated segments with a frozen short-video teacher to preserve local realism.

Practical Impact

This research has the potential to enable the generation of long, coherent videos that can be used in various applications, such as interactive world modeling, long-form story/film generation, and controllable video editing or animation. By closing the fidelity-horizon gap, this work can help bridge the gap between short and long video generation, making it possible to create high-quality, minute-scale videos.

Analogy / Intuitive Explanation

Think of video generation as a puzzle where you have to fill in the blanks between short clips. In this case, the Mode Seeking meets Mean Seeking approach is like having two separate puzzle pieces: one that helps you understand the big picture (narrative structure) and another that ensures the details (local realism) are correct. By combining these two pieces, you can create a complete and coherent picture that looks like a real video.

Paper Information

Categories:

cs.CV cs.LG

Published Date:

arXiv ID:

2602.24289v1

Quick Actions

Back to Home