Epipolar Geometry Improves Video Generation Models

Computer Vision & MultiModal AI
Published: arXiv: 2510.21615v1
Authors

Orest Kupyn Fabian Manhardt Federico Tombari Christian Rupprecht

Abstract

Video generation models have progressed tremendously through large latent diffusion transformers trained with rectified flow techniques. Yet these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes. 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. We explore how epipolar geometry constraints improve modern video diffusion models. Despite massive training data, these models fail to capture fundamental geometric principles underlying visual content. We align diffusion models using pairwise epipolar geometry constraints via preference-based optimization, directly addressing unstable camera trajectories and geometric artifacts through mathematically principled geometric enforcement. Our approach efficiently enforces geometric principles without requiring end-to-end differentiability. Evaluation demonstrates that classical geometric constraints provide more stable optimization signals than modern learned metrics, which produce noisy targets that compromise alignment quality. Training on static scenes with dynamic cameras ensures high-quality measurements while the model generalizes effectively to diverse dynamic content. By bridging data-driven deep learning with classical geometric computer vision, we present a practical method for generating spatially consistent videos without compromising visual quality.

Paper Summary

Problem
The main problem addressed in this research paper is the issue of geometric inconsistencies in videos generated by modern video diffusion models. These models struggle to maintain perfect 3D consistency, resulting in videos with unstable motion, perspective flaws, and visual artifacts. This problem affects the quality of videos generated for various applications, including animation, virtual worlds generation, and novel view synthesis.
Key Innovation
The key innovation of this research is the use of classical epipolar geometry constraints as preference signals in a preference-based optimization framework. This approach leverages the principles of projective geometry to assess 3D consistency between frames and guide the model towards improved geometric consistency. The researchers create a large dataset of generated videos annotated with 3D scene consistency metrics, enabling further research in geometry-aware video generation.
Practical Impact
This research has significant practical implications for the field of video generation. By leveraging classical epipolar geometry constraints, the researchers demonstrate that classical geometric constraints provide more stable optimization signals than modern learned metrics. This means that videos generated using this approach are more likely to have fewer geometric inconsistencies and more stable camera trajectories. The resulting models can be applied in various downstream applications, including animation, virtual worlds generation, and novel view synthesis.
Analogy / Intuitive Explanation
Imagine trying to take a picture of a building from different angles. If the camera is not aligned correctly, the building may appear distorted or uneven. This is similar to the problem faced by modern video diffusion models, where the generated videos may have geometric inconsistencies and unstable motion. The researchers use classical epipolar geometry constraints as a "compass" to guide the model towards more accurate and consistent video generation, resulting in videos that are more like a perfect photograph of the building.
Paper Information
Categories:
cs.CV
Published Date:

arXiv ID:

2510.21615v1

Quick Actions